This is an ever-evolving set of lecture notes for Introduction to Stochastic Processes (M362M). It should start with me explaining what stochastic processes are. Instead, here is a list of several questions you will be able to give answers to when you complete this course.
Question 1 In a simplistic model, the price of a share of a stock goes either up or down by \(\$1\) each day, with probability \(1/2\). You own a single share whose value today is \(\$100\), so that its tomorrow’s price will be \(\$101\) or \(\$99\) with probability \(1/2\), etc. Your strategy is to hold onto your share until one of the following two things happen: you go bankrupt (the stock price hits \(0\)), or you make a \(\$50\) dollar profit (the stock price hits \(\$150\).)
Question 2. A person carrying a certain disease infects a random number of people in a week, and then stops being infectious. Each of the infected people transmits the disease in the same way, etc. Suppose that the number of people each (infectious) individual infects is either \(0\), \(1\) or \(2\) or \(3\), each with probability \(1/4\) and that different infectious individuals may infect different number of people and behave independently of each other.
Question 3. In a game of tennis, Player \(1\) wins against Player \(2\) in each rally (the smallest chunk of the match that leads to point, i.e., to a score change from \(15-30\) to \(30-30\), for example) with probability \(p\). What is the probability that Player \(1\) wins
Question 4. A knight starts in the lower left corner of the chess board and starts moving “randomly”. That means that from any position, it chooses one of the possible (legal) moves and takes it, with all legal moves having the same probability. It keeps doing the same thing until it comes back to the square it started from.
Question 5. How does Google search work?
Learning basic R is an important part of this course, and the first order of business is to download and install an R distribution on your personal computer. We will be using RStudio as an IDE (integrated development environment). Like R itself, it is free and readily available for all major platforms. To download R to your computer, go to https://cran.rstudio.com and download the version of R for your operating system (Windows, Mac or Linux). If you are on a Mac, you want the “Latest release” which, at the time of writing, is 4.3.2, with code name “Eye Holes”. On Windows, follow the link “install R for the first time”. We are not going to do any cutting edge stuff in this class, so an older release should be fine, too, if you happen to have it already installed on your system. Once you download the installation file (.pkg on a Mac or .exe on Windows), run it and follow instructions. If you are running Linux, you don’t need me to tell you what to do. Once it is successfully installed, don’t run the installed app. We will use RStudio for that.
To install RStudio, go to https://posit.co/download/rstudio-desktop/, and follow the instructions under “2: Install RStudio”. After you download and install it, you are ready to run it. When it opens, you will see something like this
The part on the left is called the console and that is (one
of the places) where you enter commands. Before you do, it is important
to adjust a few settings. Open the options window by navigating to to
Tools->Global Options. In there, uncheck “Restore .RData into
workspace on startup” and set “Save workspace to .RData on exit” to
“Never”, as shown below:
This way, R will not pollute your environment with values you defined two weeks ago and completely forgot about. These settings are really an atavism and serve no purpose (for users like us) other than to introduce hard-to-track bugs.
There are many other settings you can play with in RStudio, but the two I mentioned above are the only ones that I really recommend setting as soon as you install it.
Finally, we need to install several R packages we will be using (mostly implicitly) during the class. First, run the following command in your console
install.packages( "tidyverse")
If R asks “Do you want to install from sources the packages which need compilation? (Yes/no/cancel)” answer no.
This will install a number of useful packages and should only take
about a minute or two. The next part is a bit longer, and can take up to
15 minutes if you have a slow computer/internet connection. You only
have to do it once, though. Skip both steps involving
tinytex below if you have LaTeX already installed on your
system1.
Start with
install.packages("tinytex")
followed by
tinytex::install_tinytex()
Note that if you go to the top right corner of each of the code blocks (gray boxes) containing instructions above, an icon will appear. If you click on it, it will copy the content of the box into your clipboard, and you can simply paste it into RStudio. You can do that with any code block in these notes.
Once R and RStudio are on your computer, it is time to get acquainted with the basics of R. This class is not about the finer points of R itself, and I will try to make your R experience as smooth as possible. After all, R is a tool that will help us explore and understand stochastic processes. Having said that, it is important to realize that R is a powerful programming language specifically created for statistical and probabilistic applications. Some knowledge of R is a valuable skill to have in today’s job market, and you should take this opportunity to learn it. The best way, of course, is by using it, but before you start, you need to know the very basics. Don’t worry, R is very user friendly and easy to get started in. In addition, it has been around for a long time (its predecessor S appeared in 1976) and is extremely well documented - google introduction to R or a similar phrase, and you will get lots of useful hits.
My plan is to give you a bare minimum in the next few paragraphs, and then to explain additional R concepts as we need them. This way, you will not be overwhelmed right from the start, and you will get a bit of a mathematical context as you learn more. Conversely, learning R commands will help with the math, too.
There at least three different ways of inputting commands into R - through console, scripts and R-notebooks.
The console, as I already mentioned, is a window in
RStudio where you can enter your R commands one by one. As a command is
entered (and enter pressed) R will run it and display the result below.
A typical console session looks like this
If you define a variable in a command, it will be available in all the
subsequent commands. This way of interacting with R is perfect for
quick-and-dirty computations and, what is somewhat euphemistically
called “prototyping”. In other words, this way you are using R as a
calculator. There is another reason why you might be using the console.
It is perfect for package installation and for help-related commands. If
you type
help('log'), the output will appear in the
Help pane on the right. You can also see all the available
variables in the Environment pane on the (top) right.
As your needs increase, you will need more complex (and longer) code
to meet them. This is where scripts come in. They are
text files (but have the extension .R) that hold R code.
Scripts can run as a whole, and be saved for later. To create a new
script, go to File->New File->R Script. That will split your
RStudio window in two:
The top part will become a script editor, and your console will shrink
to occupy the bottom part. You can write you code in there, edit and
update it, and then run the whole script by clicking on Source, or
pressing the associated shortcut key.
Inspired by Python Jupyter notebooks, R notebooks
are a creature somewhere between scripts and the console, but also have
some features of their own. An R notebook is nothing other than a
specially formatted text file which contains chunks of R code
mixed with regular text. You can think of these chunks as mini scripts.
What differentiates them from scripts is that chunks can be executed
(evaluated) and the output becomes a part of the notebook:
R notebooks are R’s implementation of literate programming. The
idea is that documentation should be written at the same time as the
program itself. As far as this course is concerned, R notebooks are just
the right medium for homework and exam submission. You can run code and
provide the interpretation of its output in a single document. See here for more
information.
Each chapter in these lecture notes is an R notebook!
The most important thing about learning R (and many
other things, for that matter) is knowing whom (and how) to ask for
help. Luckily, R is a well established language, and you can get a lot
of information by simply googling your problem. For example, if you
google logarithm in R the top hit (at the time of writing)
gives a nice overview and some examples.
Another way to get information about a command or a concept in R is
to use the command help. For example, if you input
help("log") or ?log in your console, the right
hand of your screen will display information on the function
log and some of its cousins. Almost every help entry has
examples at the bottom, and that is where I always go first.
Objects we will be manipulating in this class are almost exclusively vectors and matrices. The simplest vectors are those that have a single component, in other words, numbers. In R, you can assign a number to a variable using two different notations. Both
a <- 1
and
a = 1
will assign the value \(1\) to the
variable a. If you want to create a longer vector, you can
use the concatenation operator c as
follows:
x = c(1, 2, 3, 4)
Once you evaluate the above in your console, the value of
x is stored and you can access it by using the command
print
print(x)
## [1] 1 2 3 4
or simply evaluating x itself:
x
## [1] 1 2 3 4
Unlike all code blocks above them, the last two contain both input
and output. It is standard not to mark the output by any symbol (like
the usual >), and to mark the output by ##
which otherwise marks comments. This way, you can copy any code block
from these notes and paste it into the console (or your script) without
having to modify it in any way. Try it!
We built the vector x above by concatenating four
numbers (vectors of length 1). You can concatenate vectors of different
sizes, too:
a = c(1, 2, 3)
b = c(4, 5, 6)
(x = c(a, b, 7))
## [1] 1 2 3 4 5 6 7
You may be wondering why I put x = c(a,b,7) in
parentheses. Without them, x would still become
(1,2,3,4,5,6,7), but its value would not be printed out. A statement in
parentheses is not only evaluated, but its result is also printed out.
This way, (x = 2+3) is equivalent to x = 2+3
followed by x or print(x).
Vectors can contain things other than numbers. Strings, for example:
(x = c("Picard", "Data", "Geordi"))
## [1] "Picard" "Data" "Geordi"
If you need a vector consisting of consecutive numbers, use the colon
: notation:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
For sequences of equally spaced numbers, use the command
seq (check its help for details)
seq(from = 5, to = 20, by = 3)
## [1] 5 8 11 14 17 20
An important feature or R is that many of its functions are vectorized. That means that if you give such a function a vector as an argument, the returned value will be a vector of results of that operation performed element by element. For example
x = c(10, 20, 30)
y = c(2, 4, 5)
x + y
## [1] 12 24 35
x * y
## [1] 20 80 150
x^2
## [1] 100 400 900
cos(x)
## [1] -0.8390715 0.4080821 0.1542514
The vectors do not need to be of the same size. R uses the recycling rule - it recycles the values of the shorter one, starting from the beginning, until its size matches the longer one:
x = c(10, 20, 30, 40, 50, 60)
y = c(1, 3)
x + y
## [1] 11 23 31 43 51 63
The case where the shorter vector is of length 1 is particularly useful:
x = c(10, 20, 30, 40)
x + 1
## [1] 11 21 31 41
x * (-2)
## [1] -20 -40 -60 -80
Extracting parts of the vector is accomplished by using the
indexing operator []. Here are some
examples (what do negative numbers do?)
x = c(10, 20, 30, 40, 50)
x[1]
## [1] 10
x[c(1, 2)]
## [1] 10 20
x[-1]
## [1] 20 30 40 50
x[-c(3, 4)]
## [1] 10 20 50
x[1:4]
## [1] 10 20 30 40
x[c(1, 1, 2, 2, 5, 4)]
## [1] 10 10 20 20 50 40
People familiar with Python should be aware of the following two differences: 1. indexing starts at 1 and not 0, and 2. negative indexing removes components; it does not start counting from the end!
It is important to note that the thing you put inside []
needs to be a vector itself. The above examples all dealt with numerical
indices, but you can use logical indices, too. A variable is said to be
logical or Boolean if it can take only
one of the two values TRUE or FALSE. A vector
whose components are all logical, are called, of course, logical
vectors. You can think of logical indexing as the operation where you go
through your original vector, and choose which components you want to
keep (TRUE) and which you want the throw away
(FALSE). For example
x = c(10, 20, 30, 40, 50)
y = c(TRUE, FALSE, FALSE, TRUE, TRUE)
x[y]
## [1] 10 40 50
This is especially useful when used together with the
comparison operators. The expressions like
x < y or x == y are operators2 in R, just like
x + y or x / y. The difference is that
< and == return logical values. For
example
1 == 2
## [1] FALSE
3 > 4
## [1] FALSE
3 >= 2
## [1] TRUE
These operators are vectorized, so you can do things like this
x = c(1, 2, 3, 4, 5)
y = c(1, 3, 3, 2, 5)
x == y
## [1] TRUE FALSE TRUE FALSE TRUE
or, using recycling,
x = c(1, 2, 3, 4, 5)
x > 3
## [1] FALSE FALSE FALSE TRUE TRUE
Let’s combine that with indexing. Suppose that we want to keep only
the values greater than 4 in the vector x. The vector
y = ( x > 4 ) is going to be of the same length as
x and contain logical values. When we index x
using it, only the values of x on positions where
x > 4 will survive, and these are exactly the values we
needed:
x = c(3, 2, 5, 3, 1, 5, 6, 4)
y = (x > 4)
x[y]
## [1] 5 5 6
or, simply,
x[x > 4]
## [1] 5 5 6
Indexing can be used to set the values of a vector just as easily
x = c(10, 20, 30, 40, 50)
x[2:4] = c(0, 1, 2)
x
## [1] 10 0 1 2 50
Recycling rules apply in the same way as above
x = c(10, 20, 30, 40, 50)
x[c(1, 2, 5)] = 7
x
## [1] 7 7 30 40 7
A matrix in R can be created using the command matrix.
The unusual part is that the input is a vector and R populates the
components of the matrix by filling it in column by column or row by
row. As always, an example will make this clear
x = c(1, 2, 3, 4, 5, 6)
(A = matrix(x, nrow = 2, ncol = 3, byrow = TRUE))
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
The first argument of the function matrix is the vector
which contains all the values. If you want a matrix with m rows and n
columns, this vector should be of size \(m
n\). The arguments ncol and nrow are
self-explanatory, and byrow is a logical argument which
signals whether to fill by columns or by rows. Here is what happens when
we set byrow = FALSE
x = c(1, 2, 3, 4, 5, 6)
(A = matrix(x, nrow = 2, ncol = 3, byrow = FALSE))
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Accessing components of a matrix is as intuitive as it gets
(A = matrix(c(1, -1, 7, 2), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 7
## [2,] -1 2
A[1, 2]
## [1] 7
Note that I did not use the argument byrow at all. In
such cases, R always uses the default value (documented in the
function’s help). For matrix the default value of
byrow is FALSE, i.e., it fills the matrix
column by column. This is not what we usually want because we tend to
think of matrices as composed of rows. Moral: do not forget
byrow = TRUE if that is what you, indeed, want.
Usual matrix operations can be performed in R in the obvious way
(A = matrix(c(1, -1, 7, 2), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 7
## [2,] -1 2
(B = matrix(c(2, 2, -3, -4), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 2 -3
## [2,] 2 -4
A + B
## [,1] [,2]
## [1,] 3 4
## [2,] 1 -2
You should be careful with matrix multiplication. The naive operator
* yields a matrix, but probably not the one you want (what
does * do?)
(A = matrix(c(1, 2, 0, 1), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 0
## [2,] 2 1
(B = matrix(c(3, 5, 1, 0), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 3 1
## [2,] 5 0
A * B
## [,1] [,2]
## [1,] 3 0
## [2,] 10 0
If you want the matrix product, you have to use %*%
A %*% B
## [,1] [,2]
## [1,] 3 1
## [2,] 11 2
The following syntax is used to define functions in R:
my_function = function(x, y, z) {
return(x + y + z)
}
The function my_function returns the sum of its
arguments. Having defined it, as above, we can use it like this
my_function(1, 3, 9)
## [1] 13
Neither the output nor the arguments of a function in R are
restricted to numbers. Our next example function, named
winners, takes two vectors as arguments and returns a
vector. Its components are those components of the first input vector
(x) that are larger than the corresponding components of
the second input vector (y)
winners = function(x, y) {
z = x > y
return(x[z])
}
winners(c(1, 4, 5, 6, 2), c(2, 3, 3, 9, 2))
## [1] 4 5
Note how we used several things we learned above in this function.
First, we defined the logical vector which indicates where
x is larger than y. Then, we used logical
indexing to return only certain components of x.
Our final element of R is its if-else statement. The
syntax of the if statement is
if (condition) {
statement
}
where condition is anything that has a logical value,
and statement is any R statement. First R evaluates
condition. If it is true, it runs statement.
If it is false, nothing happens. If you want something to happen if (and
only if) your condition is false, you need an if-else
statement:
if (condition) {
statement1
} else {
statement2
}
This way, statement1 is evaluated when
condition is true and statement1 when it is
false. Since conditions inside the if statement return
logical values, we can combine them using ands, ors or
nots. The R notation for these operations is &, | and !
respectively, and to remind you what they do, here is a simple table
| x | y | x & y (and) | x | y (or) | !x (not) |
|---|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE | FALSE |
| TRUE | FALSE | FALSE | TRUE | FALSE |
| FALSE | TRUE | FALSE | TRUE | TRUE |
| FALSE | FALSE | FALSE | FALSE | TRUE |
Let’s put what we learned about functions and if-else statements
together to write a function distance_or_zero whose
arguments are coordinates x and y of a point
in the plane, and whose output is the distance from the point (x,y) to
the origin if this distance happens to be between 1 and 2, and and 0
otherwise. We will use similar functions later when we discuss Monte
Carlo methods:
distance_or_zero = function(x, y) {
distance = sqrt(x^2 + y^2)
if (distance <= 2 & distance >= 1) {
return(distance)
} else {
return(0)
}
}
distance_or_zero(1.2, 1.6)
## [1] 2
distance_or_zero(2, 3)
## [1] 0
Here are several simple problems. Their goal is to give you an idea of exactly how much R is required to get started in this course.
Compute the following (your answer should be a decimal number):
Note: some of the answers will look like this 3.14e+13.
If you do not know what that means, google scientific notation
or E notation.
Define two variables \(a\) and \(b\) with values \(3\) and \(4\) and “put” their product into a variable called \(c\). Output the value of \(c\).
Define two vectors \(x\) and \(y\) of length \(3\), such that the components of \(x\) are \(1,2,3\) and the components of \(y\) are \(8,9,0\). Ouput their (componentwise) sum.
Define a \(2\times 2\) matrix \(A=\begin{pmatrix} 1 & 2 \\ -1 & 3 \end{pmatrix}\).
Compute the matrix square \(A^2\).
Construct a vector \(x\) which contains all numbers from \(1\) to \(100\).
Construct a vector \(y\) which contains squares of all numbers between \(20\) and \(2000\).
Construct a vector \(z\) which contains only those components of \(y\) whose values are between \(400,000\) and \(500,000\).
Compute the average (arithmetic mean) of all the components of \(z\). There is an R function that does that for you - find it!
Write a function that takes a numerical argument \(x\) and returns \(5\) if \(x\geq 5\) and \(x\) itself otherwise.
Write a function that returns TRUE (a logical value)
if its argument is between \(2\) and
\(3\) and FALSE
otherwise.
(Extra credit) Write a function that takes two equal-sized vectors as arguments and returns the angle between them in degrees. For definiteness, the angle between two vectors is defined to be \(0\) when either one of them is \((0,0,\dots,0)\).
In the spirit of “learn by doing”, these lecture notes contain many “Problems”, both within the sections, and at the very end of each chapter. Those within sections come with solutions and usually introduce new concepts. They often feature a Comments section right after the solution subdivided into R and Math comments focusing on the computational or conceptual features, respectively. Note that you are not expected to be able to do the problems within sections before reading their solutions and comments, so don’t worry if you cannot. It is a good practice to try, though. Problems at the end, in the Additional Problems section are (initially) left unsolved. They do not require any new ideas and are there to help you practice the skills presented before.
… where we also review some probability along the way.
“Draw” 50 simulations from the geometric distribution with parameter \(p=0.4\).
rgeom(50, prob = 0.4)
## [1] 1 0 3 4 1 2 0 0 2 2 0 1 5 0 1 0 2 1 1 0 2 2 2 1 0 0 1 3 2 2 1 1 1 3 5 0 1 1
## [39] 0 0 0 1 2 0 1 1 1 0 1 0
Comments (R): R makes it very easy to simulate draws from a
large class of named distributions3, such as geometric,
binomial, uniform, normal, etc. For a list of all available
distributions, run help("distributions") Each available
distribution has an R name; the uniform is unif
the normal is norm and the binomial is binom,
etc. If you want to simulate \(n\)
draws (aka a sample of size \(n\)) from a distribution, you form a full
command by appending the letter r to its R name and use
\(n\) as an argument. That is how we
arrived to rgeom(50) in the solution above. The additional
arguments of the function rgeom have to do with the
parameters of that distribution. Which parameters go with which
distributions, and how to input them as arguments to rgeom
or rnorm is best looked up in R’s extensive documentation.
Try help("rnorm"), for example.
Comments (Math): You could spend your whole life trying to understand what it really means to “simulate” or “generate” a random number. The numbers you obtain from so-called random number generators (RNG) are never random. In fact, they are completely deterministically generated. Still, sequences of numbers obtained from (good) random number generators share so many properties with sequences of mythical truly random numbers, that we can use them as if they were truly random. For the purposes of this class, you can assume that the numbers R gives you as random are random enough. Random number generation is a fascinating topic at the intersection of number theory, probability, statistics, computer science and even philosophy, but we do not have the time to cover any of it in this class. If you want to read a story about a particularly bad random number generator, go here.
You might have encountered a geometric distribution before. A random variable with that distribution can take any positive integer value or \(0\), i.e., its support is \({\mathbb N}_0=\{0,1,2,3,\dots\}\). As you can see from the output above, the value \(0\) appears more often than the value \(3\), and the value \(23\) does not appear at all in this particular simulation run. The probability of seeing the value \(k\in \{0,1,2,3,\dots\}\) as a result of a single draw is given by \((1-p)^k p\), where \(p\) is called the parameter of the distribution.
That corresponds to the following interpretation of the geometric distribution: keep tossing a biased coin (with probability p of obtaining H) until you see the first H; the number Ts before that is that value your geometric random variable4 If we put these probabilities in a single table (and choose \(p=0.4\), for example) it is going to look like this:| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | … | |
|---|---|---|---|---|---|---|---|---|---|
| Prob. | 0.4 | 0.24 | 0.144 | 0.086 | 0.052 | 0.031 | 0.019 | 0.011 | … |
Of course, the possible values our random variable can take do not
stop at \(7\). In fact, there are
infinitely many possible values, but we do not have infinite space. Note
that even though the value \(23\) does
not appear in the output of the command rgeom above, it
probably would if we simulated many more than \(50\) values. Let’s try it with \(500\) draws - the table below counts how
many \(0s\), \(1s\), \(2s\), etc. we got:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 208 | 132 | 62 | 43 | 23 | 16 | 8 | 3 | 2 | 1 | 2 |
Still no luck, but we do observe values above 5 more often. By trial and error, we arrive at about \(1,000,000\) as the required number of simulations:
| 0 | 1 | 2 | 3 | … | 23 | 24 | 25 | 26 |
|---|---|---|---|---|---|---|---|---|
| 400616 | 238946 | 144274 | 86489 | … | 3 | 3 | 3 | 3 |
Compute the probability that among \(1,000,000\) draws of a geometric random variable with parameter \(p=0.4\), we never see a number greater than \(22\).
First, we compute the probability that the value seen in a single draw does not exceed \(22\):
pgeom(22, prob = 0.4)
## [1] 0.9999921
Different draws are independent of each other, so we need to raise this to the power \(1,000,000\).
(pgeom(22, prob = 0.4))^(1000000)
## [1] 0.0003717335
Comments (R): The command we used here is pgeom
which is a cousin of rgeom. In general, R commands that
involve named probability distributions consist of two parts. The
prefix, i.e., the initial letter (p in this case) stands
for the operation you want to perform, and the rest is the R name of the
distribution. There are 4 prefixes, and the commands they produce
are
| Prefix | Description |
|---|---|
r |
Simulate random draws from the distribution. |
p |
Compute the cumulative probability distribution function (cdf) (NOT pdf) |
d |
Compute the probability density (pdf) or the probability mass function (pmf) |
q |
Compute the quantile function |
(see the Math section below for the reminder of what these things
are). In this problem, we are dealing with a geometric random variable
\(X\), which has a discrete
distribution with support \(0,1,2,3,\dots\). Therefore, the R name is
geom. We are interested in the probability \({\mathbb{P}}[ X\leq 22]\), which
corresponds to the cdf of \(X\) at
\(x=22\), so we use the the prefix
p. Finally, we used the named parameter p and
gave it the value p = 0.4, because the geometric
distribution has a single parameter \(p\).
This problem also gives us a chance to discuss precision. As you can see, the probability of a single draw not exceeding \(22\) is very close to \(1\). In fact, it is equal to it to 5 decimal places. By default, R displays 7 significant digits of a number. That is enough for most applications (and barely enough for this one), but sometimes we need more. For example, let’s try to compute the probability of seeing no T (tails) in 10 tosses of a biased coin, where the probability of H (heads) is 0.9.
1 - 0.1^10
## [1] 1
While very close to it, this probability is clearly not equal to
\(1\), as suggested by the output
above. The culprit is the default precision. We can increase the
precision (up to \(22\) digits) using
the options command
options(digits = 17)
1 - 0.1^10
## [1] 0.99999999989999999
Precision issues like this one should not appear in this course, but they will out there “in the wild”, so it might be a good idea to be aware of them.
Comments (Math): If you forgot all about pdfs, cdfs and such things here is a little reminder:
| cdf | \(F(x) = {\mathbb{P}}[X\leq x]\) |
| \(f(x)\) such that \({\mathbb{P}}[X \in [a,b]] = \int_a^b f(x) \, dx\) for all \(a<b\) | |
| pmf | \(p(x)\) such that \({\mathbb{P}}[X=a_n] = p(a_n)\) for some sequence \(a_n\) |
| qf | \(q(p)\) is a number such that \({\mathbb{P}}[ X \leq q(p)] = p\) |
Those random variables that admit a pdf are called continuous. The prime examples are the normal, or the exponential distribution. The ones where a pmf exists are called discrete. The sequence \(a_n\) covers all values that such a, discrete, random variable can take. Most often, \(a_n\) either covers the set of all natural numbers \(0,1,2,\dots\) or a finite subset such as \(1,2,3,4,5,6\).
Coming back to our original problem, we note that the probability we obtained is quite small. Since \(1/0.000372\) is about \(2690\), we would have to run about \(2690\) rounds of \(1,000,000\) simulations before the largest number falls below \(23\).
Compute the \(0.05\), \(0.1\), \(0.4\), \(0.6\) and \(0.95\) quantiles of the normal distribution with mean \(1\) and standard deviation \(2\).
qnorm(c(0.05, 0.1, 0.4, 0.6, 0.95), mean = 1, sd = 2)
## [1] -2.2897073 -1.5631031 0.4933058 1.5066942 4.2897073
Comments (R): The function we used is qnorm,
with the prefix q which computes the quantile function and
the R name norm because we are looking for the quantiles of
the normal distribution. The additional (named) parameters are where the
parameters of the distribution come in (the mean and the standard
variation) in this case. Note how we plugged in the entire vector
c(0.05, 0.1, 0.4, 0.6, 0.98) instead of a single value into
qnorm. You can do that because this function is
vectorized. That means that if you give it a vector as
an argument, it will “apply itself” to each component of the vector
separately, and return the vector of results. Many (but not all)
functions in R are vectorized5.
As a sanity check, let’s apply pnrom (which computes the
cdf of the normal) to these quantile values:
p = qnorm(c(0.05, 0.1, 0.4, 0.6, 0.95), mean = 1, sd = 2)
pnorm(p, mean = 1, sd = 2)
## [1] 0.05 0.10 0.40 0.60 0.95
As expected, we got the original values back - the normal quantile function and its cdf are inverses of each other.
Comments (Math): Computing the cdf of a standard normal is the same thing reading a normal table. Computing a quantile is the opposite; you go into the middle of the table and find your value, and then figure out which “Z” would give you that value.
Simulate \(60\) throws of a fair \(10\)-sided die.
sample(1:10, 60, replace = TRUE)
## [1] 2 8 9 8 4 7 7 7 2 3 3 10 6 1 9 7 4 7 6 2 2 3 10 1 9
## [26] 7 3 2 8 4 1 2 8 1 4 9 1 9 10 10 6 1 8 6 1 10 5 1 6 9
## [51] 8 3 8 9 4 6 1 6 7 8
Comments (Math): Let \(X\) denote the outcome of a single throw of a fair \(10\)-sided die. The distribution of \(X\) is discrete (it can only take the values \(1,2,\dots, 10\)) but it is not one of the more famous named distributions. I guess you could call it a discrete uniform on \({1,2,\dots, 10}\), but a better way to describe such distribution is by a distribution table, which is really just a list of possible values a random variable can take, together with their, respective, probabilities. In this case,
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
Comments (R): The command used to draw a sample from a
(finite) collection is, of, course sample. The first
argument is a vector, and it plays the role of the “bag” from which you
are drawing. If we are interested in repeated, random samples, we also
need to specify replace = FALSE otherwise, you could draw
any single number at most once:
sample(1:10, 8, replace = FALSE)
## [1] 1 5 6 7 8 10 3 4
With more than 10 draws, we would run out of numbers to draw:
sample(1:10, 12, replace = FALSE)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
The bag you draw from can contain objects other than numbers:
sample(c("Picard", "Data", "Geordi"), 9, replace = TRUE)
## [1] "Picard" "Data" "Geordi" "Geordi" "Data" "Data" "Picard" "Data"
## [9] "Geordi"
So far, each object in the bag had the same probability of being drawn.
You can use the sample command to produce a
weighted sample, too. For example, if we wanted to simulate
\(10\) draws from the following
distribution
| 1 | 2 | 3 |
|---|---|---|
| 0.2 | 0.7 | 0.1 |
we would use the additional argument prob:
sample(c(1, 2, 3), 10, replace = TRUE, prob = c(0.2, 0.7, 0.1))
## [1] 1 2 2 1 1 2 2 3 2 2
Note how it is mostly \(2\)s, as expected.
Draw a sample of size \(n=10\) from \(N(1,2)\), i.e., from the normal distribution with parameters \(\mu=1\), \(\sigma = 2\). Plot a histogram of the obtained values. Repeat for \(n=100\) and \(n=100000\).
x = rnorm(10, mean = 1, sd = 2)
hist(x)
x = rnorm(100, mean = 1, sd = 2)
hist(x)
x = rnorm(100000, mean = 1, sd = 2)
hist(x)
Comments (R): It cannot be simpler! You use the command
hist, feed it a vector of values, and it produces a
histogram. It will even label the axes for you. If you want to learn how
to tweak various features of your histogram, type
?hist.
Esthetically, the built-in histograms leave something to be desired.
We can do better, using the package ggplot2. You don’t have
to use it in this class, but if you want to, you install it first by
running install.packages("ggplot2") (you have to do this
only once). Then, every time you want to use it, you run
library(ggplot2) to notify R that you are about to use a
function from that package. It would take a whole semester to learn
everything there is to know about ggplot2; I will only show
what a histogram looks like in it:
library(ggplot2)
z = rnorm(100000, mean = 1, sd = 2)
ggplot(data=as.data.frame(z), aes(x=z))+
geom_histogram(bins=50, fill="white", color="DarkRed")
Comments (Math):. Mathematically, histogram can be produced
for any (finite) sequence of numbers: we divide the range into several
bins, count how many of the points in the sequence falls into each bin,
and then draw a bar above that bin whose height is equal (or
proportional to) that count. The picture tells use about how the
sequence we started from is “distributed”. The order of the points does
not matter - you would get exactly the same picture if you sorted the
points first. If the sequence of points you draw the histogram of comes
from, say, normal distribution, the histogram will resemble the shape of
the pdf of a normal distribution. I say resemble, because its shape is
ultimately random. If the number of points is small (like in the second
part of this problem) the histogram may look nothing like the normal
pdf. However, when the number of points gets larger and larger, the
shape of the histogram gets closer and closer to the underlying pdf (if
it exists). I keep writing “shape” because the three histograms above
have very different scales on the \(y\)
axis. That is because we used counts to set the vertical sizes of bins.
A more natural choice is to use the proportions, i.e. relative
frequencies (i.e. counts divided by the total number of points) for bar
heights. More precisely, the bar height \(h\) over the bin \([a,b]\) is chosen so that the area of the
bar, i.e., \((b-a)\times h\) equals to
the proportion of all points that fall inside \([a,b]\). This way, the total area under the
histogram is always \(1\). To draw such
a density histogram in R we would need to add the
additional option freq = FALSE to hist:
x = rnorm(100000, mean = 1, sd = 2)
hist(x, freq = FALSE)
Note how the \(y\)-axes label changed from “Frequency” to “Density”. With such a normalization, the histogram of \(x\) can be directly compared to the probability density of a normal distribution. Here is a histogram of \(100,000\) simulations from our normal distribution with its density function (pdf) superimposed:
sims = rnorm(10000, mean = 1, sd = 2)
x = seq(-6, 8, by = 0.02)
y = dnorm(x, mean = 1, sd = 2)
hist(sims, freq = FALSE, main = "")
points(x, y, type = "l", lwd = 3, col = "red")
Let x contain \(2,000\)
draws from \(N(0,1)\), z
another \(2,000\) draws from \(N(0,1)\) and let y=x^2+z.
Draw a scatterplot of x and y to
visualize the joint distribution of x and
y
Plot two histograms, one of x and one of
y. Do they tell the whole story about the joint
distribution of x and y?
Are x and y correlated? Do
x and y in your plot “look independent”? Use
the permutation test to check of independence between x and
y.
x = rnorm(2000)
z = rnorm(2000)
y = x^2 + z
plot(x, y)
hist(x)
hist(y)
No, the two histograms would not be enough to describe the joint distribution. There are many ways in which two random variables \(X\) and \(Y\) can be jointly distributed, but whose separate (marginal) distributions match the histograms above. To give a very simple example, let \(X\) and \(Y\) be discrete random variables, each of which can only take values \(0\) or \(1\). Consider the following two possible joint distribution tables for the random pair \((X,Y)\):
|
|
In both cases, the marginals are the same, i.e., both \(X\) and \(Y\) are equally likely to take the value \(0\) or \(1\), i.e., they both have the Bernoulli distribution with parameter \(p=1/2\). That would correspond to the separate histograms to be the same. On the other hand, their joint distributions (aka dependence structures) are completely different. In the first (left) case, \(X\) and \(Y\) are independent, but in the second they are completely dependent.
They are probably not correlated since the sample correlation between
x and y is close to \(0\):
(cor(x, y))
## [1] -0.02880239
but they do not look independent.
To apply the permutation test, we first plot the scatterplot of
x vs. y as above. Then, we replace
y by a vector with the same components, but randomly
permute their positions, and then plot a scatterplot again. We repeat
this three times:
y_perm_1 = sample(y)
y_perm_2 = sample(y)
y_perm_3 = sample(y)
plot(x, y)
plot(x, y_perm_1)
plot(x, y_perm_2)
plot(x, y_perm_3)
The conclusion is clear, the first (upper-left) plot is very
different than the other three. Therefore, x and
y are probably not independent.
Comments (Math): The point of this problem is to review the notion of the joint distribution between two random variables. The most important point here is that there is more to the joint distribution of two random vectors, than just the two distributions taken separately. In a sense, the whole is (much) more than the sum of its parts. This is something that does not happen in the deterministic world. If you give me the \(x\)-coordinate of a point, and, separately, its \(y\)-coordinate, I will be able to pinpoint the exact location of that point.
On the other hand, suppose that the \(x\)-coordinate of a point is unknown, so we treat it as a random variable, and suppose that this variable admits the standard normal distribution. Do the same for \(y\). Even with this information, you cannot say anything about the position of the point \((x,y)\). It could be that the reason we are uncertain about \(x\) and the reason we are uncertain about \(y\) have nothing to do with each other; in that case we would be right to assume that \(x\) and \(y\) are independent. If, on the other hand, we got the values of both \(x\) and \(y\) by measuring them using the same, inaccurate, tape measure, we cannot assume that the errors are independent. It is more likely that both \(x\) and \(y\) are too big, or both \(x\) and \(y\) are too small.
Mathematically, we say that random variables \(X\) and \(Y\) are independent if \[{\mathbb{P}}[X \in [a,b]] \times {\mathbb{P}}[ Y
\in [c,d] ] = {\mathbb{P}}[ X\in [a,b] \text{ and } Y\in [c,d]]\text{
for all } a,b,c,d.\] While up to the point, this definition is
not very eye-opening, or directly applicable in most cases. Intuitively,
\(X\) and \(Y\) are independent if the distribution of
\(Y\) would not change if we received
additional information about \(X\). In
our problem, random variables \(X\) and
\(Y\) correspond to vectors
x and y. Their scatterplot above clearly
conveys the following message: when x is around \(-2\), we expect y to be around
4, while when x is around \(0\), y would be expected to be
around \(0\), too.
Sometimes, it is not so easy to decide whether two variables are
independent by staring at a scatterplot. What would you say about the
scatterplot below?
The permutation test is designed to help you decide
when two (simulated) random variables are likely to be independent. The
idea is simple. Suppose that
x and y are
simulations from two independent (not necessarily identical)
distributions; say x=runif(1000) and
y=rnorm(1000). The vector y_perm=sample(y) is
a randomly permuted version of y (see R section below) and
it contains exactly the same information about the distribution of
y as y itself does. Both y and
y_perm will produce exactly the same histogram. Permuting
y, however, “uncouples” it from x. If there
was any dependence between the values of x and
y before, there certainly isn’t any now. In other the joint
distribution of x and y_perm has the same
marginals as the joint distribution of x and
y, but all the (possible) dependence has been removed. What
remains is to compare the scatterplot between x and
y and the scatterplot between x and
y_perm. If they look about the same, we conclude that
x and y are independent. Otherwise, there is
some dependence between them.
One question remains: why did we have to draw three scatterplots of
permuted versions of y? That is because we have only
finitely many data points, and it can happen, by pure chance, that the
permutation we applied to y does not completely scramble
its dependence on x. With a “sample” of three such plots,
we get a better feeling for the inherent randomness in this permutation
procedure, and it is much easier to tell whether “one of these things is
not like the others”. Btw, the random variables in the scatterplot above
are, indeed, independent; here are the \(4\) permutation-test plots to “prove” it:
Unlike univariate (one-variable) distributions which are visualized
using histograms or similar plots, multivariate (several-variable)
distributions are harder to depict. The most direct relative of the
histogram is a 3d histogram. Just like the \(x\)-axis is divided into bins in the
univariate case, in the multivariate case we divide the \(xy\)-plane into regions (squares, e.g.) and
count the number of points falling into each of these regions. After
that a 3d bar (a skyscraper) is drawn above each square with the height
of each skyscraper equal (or proportional) to the number of points which
fall into its base. Here is a 3d histogram of our original pair
(x,y) from the problem. You should be able to
rotate and zoom it right here in the notes, provided your browser has
JavaScript enabled:
## Warning: `includeHTML()` was provided a `path` that appears to be a complete HTML document.
## ✖ Path: pics/3dhist.html
## ℹ Use `tags$iframe()` to include an HTML document. You can either ensure `path` is accessible in your app or document (see e.g. `shiny::addResourcePath()`) and pass the relative path to the `src` argument. Or you can read the contents of `path` and pass the contents to `srcdoc`.
A visualization solution that requires less technology would start the same way, i.e., by dividing the \(xy\) plane into regions, but instead of the third dimension, it would use different colors to represent the counts. Here is an example where the regions are hexagons, as opposed to squares; it just looks better, for some reason:
Just to showcase the range of possibilities, here is another
visualization technique which which requires deeper statistical tools,
namely the density contour plot:
Comments (R): There is very little new R here. You should
remember that if x and y are vectors of the
same length, plot(x,y) gives you a scatterplot of
x and y.
To compute the sample correlation between two vectors, use the
cor.
We used the command sample(y) to obtain a randomly
permuted version of y. The simplicity of this is due to
default parameters of the command sample which we already
learned about. In particular, the default number of samples is exactly
the size of the input vector y and, by default, sampling is
performed without replacement. If you think about it for a
second, you will realize that a sample of size \(n\) from the vector of size \(n\) without replacement is nothing
by a random permutation of y.
You are not required to do this in your submissions, but if you want
to display several plots side-by-side, use the command
par(mfrow=c(m,n)) before the command plot. It
tells R to plot the next \(mn\) plots
in a \(m\times n\) grid.
Let the random variables \(X\) and \(Y\) have the joint distribution given by the following table:
| 1 | 2 | 3 | |
|---|---|---|---|
| 1 | 0.1 | 0.2 | 0.3 |
| 2 | 0.2 | 0.2 | 0.0 |
Simulate \(10,000\) draws from the distribution of \((X,Y)\) and display a contingency table of your results.
joint_distribution_long = data.frame(
x = c(1, 1, 1, 2, 2, 2),
y = c(1, 2, 3, 1, 2, 3)
)
probabilities_long =
c(0.1, 0.2, 0.3, 0.2, 0.2, 0.0)
sampled_rows = sample(
x = 1:nrow(joint_distribution_long),
size = 10000,
replace = TRUE,
prob = probabilities_long
)
draws = joint_distribution_long[sampled_rows, ]
table(draws)
## y
## x 1 2 3
## 1 962 2027 3047
## 2 1945 2019 0
Comments (Math): The main mathematical idea is to think of each pair of possible values of \(X\) and \(Y\) as a separate “object”, put all these objects in a “bag”, then then draw from the bag. In other words, we convert the bivariate distribution from the problem to the following univariate distribution
| (1,1) | (1,2) | (1,3) | (2,1) | (2,2) | (2,3) |
|---|---|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.2 | 0.2 | 0 |
and sample from it. When you do, you will get a vector whose elements are pairs of numbers. The last step is to extract the components of those pairs into separate vectors.
Comments (R): The most important new R concept here is
data.frame. You should think of it as a spreadsheet. It is,
mathematically, a matrix, but we do not perform any mathematical
operations on it. Moreover, not all columns in the data frame have to be
numeric. Some of them can be strings, and other can be something even
more exotic. You should think of a data frame as a bunch of column
vectors of the same length stacked side by side. It is important to note
that each column of a data frame will have a name, so that we don’t have
to access it by its position only (as we would have to in the case of a
matrix).
In this class, the column vectors of data frames are going to contain simulated values. In statistics, it is data that comes in data frames, with rows corresponding to different observations, and columns to various observed variables.
The easiest way to construct a data frame using already existing vectors is as follows:
x = c(1, 2, 3)
y = c("a", "b", "c")
(df = data.frame(x, y))
## x y
## 1 1 a
## 2 2 b
## 3 3 c
Note that the two columns inherited their names from the vectors
x and y that fed into them. Note, also, that
all rows got consecutive numerical values as names by default. Row names
are sometimes useful to have, but are in general a nuisance and should
be ignored (especially in this class). Column names are more important,
and there is a special notation (the dollar-sign notation) that allows
you to access a column by its name:
df$y
## [1] "a" "b" "c"
If you want to give your columns custom names (or if you are building them out of explicitly given vectors as in the solution above) use the following syntax
z = c("a", "b", "c", "d")
(df = data.frame(letters = z, numbers = c(1, 2, 3, 4)))
## letters numbers
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
A feature that data frames share with vectors and matrices is that
you can use vector indexing as in the following example (where
df is as above)
df[c(2, 4, 4, 1), ]
## letters numbers
## 2 b 2
## 4 d 4
## 4.1 d 4
## 1 a 1
Make sure you understand why the expression inside the brackets is
c(2,4,4,1), and not c(2,4,4,1). R’s desire to
keep row names unique leads to some cumbersome constructs such as
4.1 above. As I mentioned before, just disregard them.
A nice thing about data frames is that they can easily be pretty-printed in RStudio. Go to the Environment tab in one of your RStudio panes, and double click on the name of the data frame you just built. It will appear as a nicely formatted spreadsheet.
Once we have the data frame containing all \(6\) pairs of possible values \(X\) and \(Y\) can take (called
joint_distribution_long in the solution above), we can
proceed by sampling from its rows, by sampling from the set
1,2,3,4,5,6 with probabilities
0.1, 0.2, 0.3, 0.2, 0.2, 0.0. The result of the
corresponding sample command will be a sequence - called
sampled_rows in the solution - of length \(10,000\) composed of numbers \(1,2,3,4,5\) or \(6\). The reason we chose the name
sampled_rows is because each number corresponds to a row
from the data frame joint_distribution_long, and by
indexing joint_distribution_long by
sampled_rows we are effectively sampling from its rows. In
other words, the command
joint_distribution_long[sampled_rows, ] turns a bunch of
numbers into a bunch of rows (many of them repeated) of the data frame
joint_distribution_long.
The final step is to use the function table. This time,
we are applying it to a data frame and not to a vector, but the effect
is the same. It tabulates all possible combinations of values of the
columns, and counts how many times each of them happened. The same
result would have been obtained by calling
table(draws$x, draws$y).
Use Monte Carlo to estimate the expected value of the exponential random variable with parameter \(\lambda= 4\) using \(n=10\), \(n=1,000\) and \(n=1,000,000\) simulations. Compare to the exact value.
x = rexp(10, rate = 4)
mean(x)
## [1] 0.1779768
For an exponential random variable with parameter \(\lambda\), the expected value is \(1/\lambda\) (such information can be found in Appendix A) which, in this case, is \(0.25\). The error made was 0.072023 for \(n=10\) simulations.
We increase the number of simulations to \(n=1000\) and get a better result
x = rexp(1000, rate = 4)
mean(x)
## [1] 0.2564643
with (smaller) error -0.0064643. Finally, let’s try \(n=1,000,000\):
x = rexp(1000000, rate = 4)
mean(x)
## [1] 0.250381
The error is even smaller -0.00038101.
Comments (R): The only new thing here is the command
mean which computes the mean of a vector.
Comments (Math): There is a lot going on here conceptually. This is the first time we used the Monte Carlo method. It is an incredibly useful tool, as you will keep being reminded throughout this class. The idea behind it is simple, and it is based on the Law of large numbers:
Theorem Let \(X_1,X_2,
\dots\) be an independent sequence of random variables with the
same distribution, for which the expected value can be computed. Then
\[ \tfrac{1}{n} \Big( X_1+X_2+\dots+X_n\Big)
\to {\mathbb{E}}[X_1] \text{ as } n\to\infty\] The idea behind
Monte Carlo is to turn this theorem “upside down”. The goal is to
compute \({\mathbb{E}}[X_1]\) and use a
supply of random numbers, each of which comes from the same
distribution, to accomplish that. The random number generator inside
rexp gives us a supply of numbers (stored in the vector
x) and all we have to do is compute their average. This
gives us the left-hand side of the formula above, and, if \(n\) is large enough, we hope that this
average does not differ too much from its theoretical limit. As \(n\) gets larger, we expect better and
better results. That is why your error above gets smaller as \(n\) increases.
It looks like Monte Carlo can only be used to compute the expected value of a random variable, which does not seem like such a bit deal. But it is! You will see in the sequel that almost anything can be written as the expected value of some random variable.
Use Monte Carlo to estimate \({\mathbb{E}}[X^2]\), where \(X\) is a standard normal random variable.
You may or may now know that when \(X\) is standard normal \(Y=X^2\) has a \(\chi^2\) distribution with one degree of freedom. If you do, you can solve the problem like this:
y = rchisq(5000, df = 1)
mean(y)
## [1] 0.9771929
If you don’t, you can do the following:
x = rnorm(5000)
y = x^2
mean(y)
## [1] 1.019852
Comments (Math+R): We are asked to compute \({\mathbb{E}}[ X^2]\), which can be
interpreted in two ways. First, we can think of \(Y=X^2\) as a random variable in its own
right and you can try to take draws from the distribution of \(Y\). In the case of the normal
distribution, the distribution of \(Y\)
is known - it happens to be a \(\chi^2\)-distribution with a single degree
of freedom (don’t worry if you never heard of it). We can simulate it in
R by using its R name chisq and get a number close to the
exact value of \(1\).
If you did not know about the \(\chi^2\) distribution, you would not know
what R name to put the prefix r in front of. What makes the
simulation possible is the fact that \(Y\) is a transformation of a
random variable we know how to simulate. In that case, we simply
simulate the required number of draws x from the normal
distribution (using rnorm) and then apply the
transformation \(x \mapsto x^2\) to the
result. The transformed vector y is then nothing but the
sequence of draws from the distribution of \(X^2\).
The idea described above is one of main advantages of the Monte Carlo technique: if you know how to simulate a random variable, you also know how to simulate any (deterministic) function of it. That fact will come into its own a bit later when we start working with several random variables and stochastic processes, but it can be very helpful even in the case of a single random variable, as you will see in the next problem.
Let \(X\) be a standard normal random variable. Use Monte Carlo to estimate the probability \({\mathbb{P}}[ X > 1 ]\). Compare to the exact value.
The estimated probability:
x = rnorm(10000)
y = x > 1
(p_est = mean(y))
## [1] 0.1608
The exact probability and the error
p_true = 1 - pnorm(1)
(err = p_est - p_true)
## [1] 0.002144746
Comments (R): As we learned before, the symbol
> is an operation, which returns a Boolean
(TRUE or FALSE) value. For example:
1 > 2
## [1] FALSE
5^2 > 20
## [1] TRUE
It is vectorized:
x = c(1, 2, 4)
y = c(5, -4, 3)
x > y
## [1] FALSE TRUE TRUE
and recycling rules apply to it (so that you can compare a vector and a scalar, for example)
x = 1:10
x > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Therefore, the vector y in the solution is a vector of
length \(10000\) whose elements are
either TRUE or FALSE; here are the first 5
rows of data frame with columns x and y from
our solution:
| x | y |
|---|---|
| 1.9493 | FALSE |
| -1.1015 | TRUE |
| 1.0448 | TRUE |
| -0.1384 | TRUE |
| -0.2573 | TRUE |
Finally, z contains the mean of
y. How do you compute a mean of Boolean values? In R (and
many other languages) TRUE and FALSE have
default numerical values, usually \(1\)
and \(0\). This way, when \(R\) is asked to compute the
sum of a Boolean vector it will effectively count the
number of values which are TRUE. Similarly, the
mean is the relative proportion of TRUE
values.
Comments (Math): We computed the proportion of the “times” \(X>1\) (among many simulations of \(X\)) and used it to approximate the probability \({\mathbb{P}}[ X>1]\). More formally, we started from a random variable \(X\) with a normal distribution and then transformed it into another random variable, \(Y\), by setting \(Y=1\) whenever \(X>1\) and \(0\) otherwise. This is often written as follows \[ Y = \begin{cases} 1, & X>1 \\ 0, & X\leq 1.\end{cases}\] The random variable \(Y\) is very special - it can only take values \(0\) and \(1\) (i.e., its support is \(\{0,1\}\)). Such random variables are called indicator random variables, and their distribution, called the Bernoulli distribution, always looks like this:
| 0 | 1 |
|---|---|
| 1-p | p |
for some \(p \in [0,1]\). The parameter \(p\) is nothing but the probability \({\mathbb{P}}[Y=1]\).
So why did we decide to transform \(X\) into \(Y\)? Because of the following simple fact: \[ {\mathbb{E}}[ Y] = 1 \times p + 0 \times (1-p) = p.\] The expected value of an indicator is the probability \(p\), and we know that we can use Monte Carlo whenever we can express the quantity we are computing as an expected value of a random variable we know how to simulate.
Many times, simulating a random variable is easier than analyzing it analytically. Here is a fun example:
Use Monte Carlo to estimate the value of \(\pi\) and compute the error.
nsim = 1000000
x = runif(nsim,-1, 1)
y = runif(nsim,-1, 1)
z = (x ^ 2 + y ^ 2) < 1
(pi_est = 4 * mean(z))
## [1] 3.141728
(err = pi_est - pi)
## [1] 0.0001353464
Comments (Math):
As we learned in the previous problem, probabilities of events can be
computed using Monte Carlo, as long as we know how to simulate the
underlying indicator random variable. In this case, we want to compute
\(\pi\), so we would need to find a
“situation” in which the probability of something is \(\pi\). Of course, \(\pi>1\), so it cannot be a probability
of anything, but \(\pi/4\) can, and
computing \(\pi/4\) is as useful as
computing \(\pi\). To create the
required probabilistic “situation” we think of the geometric meaning of
\(\pi\), and come up with the following
scheme. Let \(X\) and \(Y\) be two independent uniform random
variables each with values between \(-1\) and \(1\). We can think of the pair \((X,Y)\) as a random point in the square
\([-1,1]\times [-1,1]\). This point
will sometimes fall inside the unit circle, and sometimes it will not.
What is the probability of hitting the circle? Well, since \((X,Y)\) is uniformly distributed everywhere
inside the square, this probability should be equal to the portion of
the area of our square which belongs to the unit circle. The area of the
square is \(4\) and the area of the
circle is \(\pi\), so the required
probability is \(\pi/4\). Using the
idea from the previous problem, we define the indicator random variable
\(Z\) as follows \[ Z = \begin{cases} 1 & (X,Y) \text{ is inside
the unit circle, } \\ 0 & \text{ otherwise.}
\end{cases}
= \begin{cases} 1& X^2+Y^2 < 1, \\ 0 & \text{ otherwise.}
\end{cases}\]
1. Write an R function cumavg which computes the
sequence of running averages of a vector, i.e., if the input is \(x=(x_1,x_2,x_3,\dots, x_n)\), the output
should be \[ \Big(x_1, \frac{1}{2} (x_1+x_2),
\frac{1}{3}(x_1+x_2+x_3), \dots, \frac{1}{n}
(x_1+x_2+\dots+x_n)\Big).\] Test it to check that it really
works. (Hint: look up the function cumsum. )
cumavg to the vector \(4 z\) from the previous problem and plot
your results (use a smaller value for nsim. Maybe \(1000\).) Plot the values against their
index. Add a red horizontal line at the level \(\pi\). Rerun the same code (including the
simulation part) several times.
cumsum, the problem
becomes much easier.cumavg = function(x) {
c = cumsum(x)
n = 1:length(x)
return(c/n)
}
x = c(1, 3, 5, 3, 3, 9)
cumavg(x)
## [1] 1 2 3 3 3 4
nsim = 1000
x = runif(nsim,-1, 1)
y = runif(nsim,-1, 1)
z = (x ^ 2 + y ^ 2) < 1
pi_est = cumavg(4 * z)
plot(
1:nsim,
pi_est,
type = "l",
xlab = "number of simulations",
ylab = "estimate of pi",
main = "Computing pi by Monte Carlo"
)
abline(pi, 0,
col = "red")
Comments (R): This course is not about R graphics, but I
think it is a good idea to teach you how to make basic plots in R. We
already used the function plot to draw scatterplots. By
default, each point drawn by plot is marked by a small
circle so it might not seem like a good idea to use it. Luckily this,
and many other things, can be adjusted by numerous additional arguments.
One of such arguments is type which determines the type of
the plot. We used type="l" which tells R to join the points
with straight lines:
x = c(1, 3, 4, 7)
y = c(2, 1, 5, 5)
plot(x, y, type = "l")
The other arguments,
xlab, ylab and
main determine labels for axes and the entire plot. The
function abline(a,b) adds a line \(y = a x + b\) to an already existing plot.
It is very useful in statistics if one wants to show the regression line
superimposed on the scatterplot of data. Finally, the argument
col, of course, determines the color of the line. To learn
about various graphical parameters, type ?par.
Comments (Math): The conceptual reason for this exercise is to explore (numerically) the kinds of errors we make when we use Monte Carlo. Unlike the deterministic numerical procedures, Monte Carlo has a strange property that no bound on the error can be made with absolute certainty. Let me give you an example. Suppose that you have a biased coin, with the probability \(0.6\) of heads and \(0.4\) of tails. You don’t know this probability, and use a Monte Carlo technique to estimate it - you toss your coin \(1000\) times and record the number of times you observe \(H\). The law of large numbers suggests that the relative frequency of heads is close to the true probability of \(H\). Indeed, you run a simulation
x = sample(c("T", "H"), 1000, prob = c(0.4, 0.6), replace = TRUE)
y = x == "H"
mean(y)
## [1] 0.594
and get a pretty accurate estimate of \(0.594\). If you run the same code a few more times, you will get different estimates, but all of them will be close to \(0.6\). Theoretically, however, your simulation could have yielded \(1000\) Hs, which would lead you to report \(p=1\) as the Monte-Carlo estimate. The point is that even though such disasters are theoretically possible, they are exceedingly unlikely. The probability of getting all \(H\) in \(1000\) tosses of this coin is a number with more than \(500\) zeros after the decimal point.
The take-home message is that even though there are no guarantees,
Monte Carlo performs well the vast majority of the time. The crucial
ingredient, however, is the number of simulations. The plot you were
asked to make illustrates exactly that. The function cumavg
gives you a whole sequence of Monte-Carlo estimates of the same thing
(the number \(\pi\)) with different
numbers of simulations nsim. For small values of
nsim the error is typically very large (and very random).
As the number of simulations grows, the situations stabilizes and the
error decreases. Without going into the theory behind it, let me only
mention is that in the majority of practical applications we have the
following relationship: \[ error \sim
\frac{1}{\sqrt{n}}.\] In words, if you want to double the
precision, you need to quadruple the number of simulations. If you want
an extra digit in your estimate, you need to multiply the number of
simulations by \(100\). Here is an
image where I superimposed \(40\) plots
like the one you were asked to produce (the dashed lines are \(\pm \frac{4}{\sqrt{n}}\)):
Let \(X\) and \(Y\) be two independent geometric random variables with parameters \(p=0.5,\) and let \(Z=X+Y\). Compute \({\mathbb{P}}[ X = 3| Z = 5]\) using simulation. Compare to the exact value.
By simulation:
nsim = 1000000
X = rgeom(nsim, prob = 0.5)
Y = rgeom(nsim, prob = 0.5)
Z = X + Y
X_cond = X[Z == 5]
mean(X_cond == 3)
## [1] 0.1684758
To get the exact value, we start from the definition: \[ {\mathbb{P}}[ X = 3 | Z= 5 ] = \frac{{\mathbb{P}}[ X=3 \text{ and }Z=5]}{{\mathbb{P}}[Z=5]} = \frac{{\mathbb{P}}[X=3 \text{ and }Y = 2]}{{\mathbb{P}}[Z=5]}, \] where the last equality follows from the fact that \(\{ X=3 \text{ and } Z=5 \}\) is exactly the same event as \(\{ X = 3 \text{ and } Y=2\}\). Since \(X\) and \(Y\) are independent, we have \[{\mathbb{P}}[ X=3 \text{ and }Y=2 ] = {\mathbb{P}}[X=3] \times {\mathbb{P}}[ Y=2] = 2^{-4} 2^{-3} = 2^{-7}.\] To compute \({\mathbb{P}}[ Z = 5]\) we need to split the event \(\{ Z = 5 \}\) into events we know how to deal with. Since \(Z\) is built from \(X\) and \(Y\), we write \[ \begin{align} {\mathbb{P}}[ Z = 5 ] = &{\mathbb{P}}[X=0 \text{ and }Y=5]+ {\mathbb{P}}[ X=1 \text{ and }Y=4] + {\mathbb{P}}[ X=2 \text{ and }Y=3] + \\ & {\mathbb{P}}[ X=3 \text{ and }Y=2] + {\mathbb{P}}[ X=4 \text{ and }Y=1] + {\mathbb{P}}[ X = 5 \text{ and }Y=0]. \end{align}\] Each of the individual probabilities in the sum above is \(2^{-7}\), so \({\mathbb{P}}[ X = 3 | Z = 5] = \frac{1}{6}\). This gives us an error of 0.0018091.
Comments (Math): Let us, first, recall what the conditional probability is. The definition we learn in the probability class is the following \[ {\mathbb{P}}[A | B] = \frac{{\mathbb{P}}[A \text{ and }B]}{{\mathbb{P}}[B]},\] as long as \({\mathbb{P}}[B]>0\). The interpretation is that \({\mathbb{P}}[A|B]\) is still the probability of \(A\), but now in the world where \(B\) is guaranteed to happen. Conditioning usually happens when we receive new information. If someone tells us that \(B\) happened, we can disregard everything in the complement of \(B\) and adjust our probability to account for that fact. First we remove from \(A\) anything that belongs to the complement of \(B\), and recompute the probability \({\mathbb{P}}[A \cap B]\). We also have to divide by \({\mathbb{P}}[B]\) because we want the total probability to be equal to \(1\).
Our code starts as usual, but simulating \(X\) and \(Y\) from the required distribution, and
constructing a new vector \(Z\) as
their sum. The variable X_cond is new; we build it from
\(X\) by removing all the elements
whose corresponding \(Z\) is
not equal to \(5\). This is an
example of what is sometimes called the rejection
method in simulation. We simply “reject” all simulations which
do not satisfy the condition we are conditioning on. We can think of
X_cond as bunch of simulations of \(X\), but in the world where \(Z=5\) is guaranteed to happen. Once we have
X_cond, we proceed as usual by computing the relative
frequency of the value \(3\) among all
possible values \(X\) can take. Note
that the same X_cond can also be used to compute the
conditional probability \({\mathbb{P}}[ X=1|
Z=5]\). In fact, X_cond contains the information
about the entire conditional distribution of \(X\) given \(Z=5\); if we draw a histogram of
X_cond, we will get a good idea of what this distribution
looks like:
Since X_cond contains only discrete values from \(0\) to \(5\), a contingency table might be a better
tool for understanding its distribution:
| 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| 7745 | 7761 | 7691 | 7807 | 7731 | 7604 |
The histogram and the table above suggest that the distribution of \(X\), given \(Z=5\), is uniform on \(\{0,1,2,3,4,5\}\). It is - a calculation almost identical to the one we performed above gives that \({\mathbb{P}}[ X= i| Z=5] = \frac{1}{6}\) for each \(i=0,1,2,3,4,5\).
One more observation at the end. Note that we drew \(n=1,000,000\) simulations this time. While
it is probably an overkill for this particular example, conditional
probabilities in general require more simulations than unconditional
ones. Of course, that is because we reject most of our original draws.
Indeed, the size of the vector X_cond is 46339 - more than
a \(20\)-fold decrease. This fact
becomes particularly apparent when we try to use Monte Carlo for
conditional distributions associated with continuous random
vectors as we will see in out next problem.
Let \(X\) and \(Y\) be independent random variables where \(X\) has the \(N(0,1)\) distribution and \(Y\) the exponential distribution with parameter \(\lambda=1\). Find a graphical approximation to the conditional density of \(Y\), given \(X+Y\geq 1\). Repeat the same, but condition on \(X+Y=1\).
nsim = 100000
x = rnorm(nsim)
y = rexp(nsim)
cond = x + y >= 1
x_cond = x[cond]
hist(x_cond, breaks = 100)
nsim = 100000
eps = 0.1
x = rnorm(nsim)
y = rexp(nsim)
cond = (1 - eps < x + y) & (x + y < 1 + eps)
x_cond = x[cond]
hist(x_cond, breaks = 100)
Comments (Math): In the case of conditioning on \(X+Y\geq 1\) we repeated the same procedure as in the discrete case. We simply rejected all draws that do not satisfy the condition.
When Conditioning on \(X+Y=1\),
however, you immediately encounter a problem that you don’t get with
discrete distributions. The event \(\{
X+Y=1\}\) has probability \(0\)
and will never happen. That means that our strategy form the previous
problem will simply not work - you will reject all
draws. The problem goes beyond a particular approach to the problem, as
the conditional probabilities such as \({\mathbb{P}}[ Y \geq 0 | X+Y=1]\) are not
well defined. Indeed, the formula \[
{\mathbb{P}}[ Y \geq 0 | X+Y=1] "=" \frac{{\mathbb{P}}[ Y\geq
0 \text{ and } X+Y=1]}{ {\mathbb{P}}[X+Y=1]}\] requires that the
probability in the denominator be strictly positive. Otherwise you are
dividing by zero. The theoretical solution to this is by no means simple
and requires mathematics beyond the scope of these notes. Practically,
there is a very simple way of going around it. Instead of conditioning
on the zero-probability event \(X+Y=1\), we use a slightly more relaxed
condition \[ X+Y \in (1-\varepsilon,
1+\varepsilon) \] for a small, but positive, \(\varepsilon\). In many cases of interest,
this approximation works very well, as long as \(\varepsilon\) is not too big. How big?
Well, that will depend on the particular problem, as well as on the
number of simulations you are drawing. The best way is to try several
values and experiment. For example, if we chose \(\varepsilon=0.01\) in our problem, the
number of elements in x_cond (i.e., the number of
non-rejected draws) would be on the order of \(100\), which may be considered to small to
produce an accurate histogram. On the other hand, when \(\varepsilon=1\), your result will be
inaccurate because you are conditioning on the event \(0 < X+Y < 2\) which is a poor
approximation for \(X+Y=1\). The rule
of thumb is to take the smallest \(\varepsilon\) you can, while keeping the
number of non-rejected draws sufficiently large.
Find the Weibull distribution in R’s help system. Simulate \(n=10000\) draws from the Weibull distribution with shape parameter \(2\) and scale parameter \(3\). Draw a histogram of your simulations.
Suppose that the vector x contains \(n=10000\) simulations from the standard
normal \(\mu=0, \sigma=1)\). Without
simulating any new random numbers, transform it into the vector
y such that y is a vector of \(n=10000\) simulations from the normal with
\(\mu=1\) and \(\sigma=0.5\). Draw histograms of both
x and y on the same plot. (Note: the
extra parameter add is used to superimpose plots. You may
want to use different colors, too. Use the parameter col
for that. )
Starting with x=seq(-3,3,by=0.1), define the
appropriate vector y and use x and
y to plot the graph of the cdf of the standard normal. The
command you want to use is plot with the following extra
arguments
type="l" (to get a smooth line instead of a bunch of
points).main="The CDF of the standard normal" (to set the
title), and
The R name for the Weibull distribution is weibull
and the arguments names corresponding to the shape and scale parameters
are shape and scale:
x = rweibull(10000, shape = 2, scale = 3)
hist(x)
Let \(X\) be a normally distributed random variable, with parameters \(\mu_X\) and \(\sigma_X\). When we apply a linear transformation \(Y = \alpha X + \beta\) to X, the result \(Y\) has a normal distribution again, but with different parameters. These parameters, call them \(\mu_Y\) and \(\sigma_Y\), are easily identified by taking the expected value and the variance:
\[\begin{align} \mu_Y & = {\mathbb{E}}[Y] = \alpha {\mathbb{E}}[X] + \beta = \alpha \mu_X + \beta \\ \sigma_Y^2 & = \operatorname{Var}[Y] = \operatorname{Var}[\alpha X + \beta] = \alpha^2 \operatorname{Var}[X] = \alpha^2 \sigma_X^2 \end{align}\]
In the problem we are given \(\mu_X=0\) and \(\sigma_X=1\), so we must take \(\alpha = 0.5\) and \(\beta=1\) to get \(\mu_Y=1\) and \(\sigma_Y=0.5\) (note that this is exactly the opposite of taking \(z\)-scores, where we transform a general normal into the standard normal). In R
x = rnorm(10000)
y = 0.5 * x + 1
Let’s check that the parameters of y are as as
required:
(mean(y))
## [1] 1.002488
(sd(y))
## [1] 0.4989339x = seq(-3, 3, by = 0.1)
y = pnorm(x)
plot(x, y, type = "l", ylab = "F(x)", main = "The CDF of the standard normal")
Simulate \(n=1000\) draws from the distribution whose distribution table is given by
|
2 |
4 |
8 |
16 |
|---|---|---|---|
|
0.2 |
0.3 |
0.1 |
0.4 |
Draw a histogram of your results.
You may have learned in probability how to compute the pdf \(f_Y(y)\) of a transformation \(Y=g(X)\) of a random variable with pdf \(f_X(x)\). Suppose that you forgot how to do that, but have access to \(10,000\) simulations from the distribution of \(X\). How would you get an approximate idea about the shape of the function \(f_Y\)?
More concretely, take \(X\) to be exponentially distributed with parameter \(1\) and \(g(x) = \sin(x)\) and produce a picture that approximates the pdf \(f_Y\) of \(Y\). (Note: even if you remember how to do this analytically, you will run into a difficulty. The function \(\sin(x)\) is not one-to-one and the method usually taught in probability classes will not apply. If you learned how to do it in the many-to-one case of \(g(x)= \sin(x)\), kudos to your instructor!)
Let \(X\) be a random variable with the Cauchy distribution, and \(Y = \operatorname{arctan}(X)\). R allows you to simulate from the Cauchy distribution, even if you do not know what it is. How would you use that to make an educated guess as to what the distribution of \(Y\) is? To make your life easier, consider \(\tfrac{2}{\pi} Y\) first.
x = sample(c(2, 4, 8, 16), size = 10000, prob = c(0.2, 0.3, 0.1, 0.4), replace = TRUE)
hist(x)
Note: given that we are dealing with a discrete distribution, a contingency table might be a better choice:
|
2 |
4 |
8 |
16 |
|---|---|---|---|
|
2044 |
2945 |
1045 |
3966 |
We apply the function \(\sin\) to the simulations. The histogram of the obtained values is going to be a good (graphical) approximation to the pdf of the transformed random variable:
x = rexp(100000)
y = sin(x)
hist(y)
Having learned that histograms look like the pdfs of the underlying distributions, we draw the histogram:
x = rcauchy(10000)
y = atan(x) * 2/pi
hist(y)
It looks uniform (if we replace \(10,000\) by \(100,000\)&_t + it will look even more uniform). We conclude that \(2/\pi \arctan(X)\) is probably uniformly distributed on \((-1,1)\). Hence, \(Y = \arctan(X)\) is probably uniformly distributed on \((-\pi/2, \pi/2)\).
A basic method for obtaining simulations draws from distributions
other than the uniform is the transformation method.
The idea is to start with (pseudo) random numbers, i.e., draws from the
uniform \(U(0,1)\) distribution, and
then apply a function \(g\) to each
simulation. The difficulty is, of course, how to choose the right
function \(g\).
Let \(X\) be a random variable with a continuous
and strictly increasing cdf \(F\). What
is the distribution of \(Y=F (X)\)?
What does that have to do with the transformation method?
(Hint: if you are having difficulty with this problem, feel free to run some experiments using R. )
Let us perform an experiment where \(X \sim
N(0,1)\). Remembering that the cdf is given by the R function
pnorm:
x = rnorm(100000)
y = pnorm(x)
hist(y)
This looks like a histogram of a uniform distribution on \((0,1)\). Let’s try with some other continuous distributions
x1 = rexp(100000)
x2 = rcauchy(100000)
x3 = runif(100000)
x4 = rgamma(100000, shape = 3)
par(mfrow = c(2, 2))
hist(pexp(x1))
hist(pcauchy(x2))
hist(punif(x3))
hist(pgamma(x4, shape = 3))
All of those point to the same conjecture, namely that \(F(X)\) is uniformly distributed on \((0,1)\). To prove that, we take \(Y=F(X)\) and try to compute that cdf \(F_Y\) of \(Y\): \[F_Y(y) = {\mathbb{P}}[ Y \leq y] = {\mathbb{P}}[ F(X) \leq y]\] Since \(F\) is strictly increasing, it admits an inverse \(F^{-1}\). Moreover, for any \(y \in (0,1)\), the set of all values of \(x\) such that \(F(x)\leq y\) (the red range) is exactly the interval \((-\infty, F^{-1}(y)]\) (the blue range), as in the picture below:
Hence, \[F_Y(y)={\mathbb{P}}[Y\leq y] = {\mathbb{P}}[ F(X) \leq y] = {\mathbb{P}}[ X \leq F^{-1}(y) ] = F(F^{-1}(y)) = y, \text{ for } y\in (0,1).\] The cdf \(F_Y\) is, therefore, equal to the cdf of a uniform on \((0,1)\). Since the cdf uniquely determines the distribution, \(Y\) must be uniformly distributed on \((0,1)\).
Let \(f_1\) and \(f_2\) be two pdfs. We take a constant \(\alpha \in (0,1)\) and define the function
\(f\) by \[
f(x) = \alpha f_1(x) + (1-\alpha) f_2(x).\] The function \(f\) is the pdf of a third distribution,
which is called the mixture of \(f_1\) and \(f_2\) with weights \(\alpha\) and \(1-\alpha\). Assuming that you know
how to simulate from the distributions with pdfs \(f_1\) and \(f_2\), how would you draw \(10,000\) simulations from the mixture \(f\)? Show your method on the example of a
mixture of \(N(0,1)\) and \(N(4,1)\) with \(\alpha=2/3\). Plot the histogram of the
obtained sample (play with the parameter breaks until you
get a nice picture.)
(Hint: start with two vectors, the first containing \(10,000\) simulations from \(f_1\) and the second from \(f_2\). Then “toss” \(10,000\) biased coins with \(\mathbb{P}[ H ] = \alpha\) … )
The double exponential or Laplace distribution is a continuous probability distribution whose pdf is given by \[ \tfrac{1}{2} \exp(-|x|), x\in {\mathbb R}.\] This distribution is not built into R. How would you produce simulations from the double exponential using R?
The idea is that before each draw a biased coin (with \({\mathbb{P}}[H]=\alpha\)) is tossed. If
\(H\) is obtained, we draw from the
distribution with pdf \(f_1\).
Otherwise, we draw from the distribution with pdf \(f_2\). We write a function which performs
one such simulation, and then use the command replicate to
call it several times and store the results in the vector:
single_draw = function() {
coin = sample(c(1, 2), prob = c(2/3, 1/3), size = 1, replace = TRUE)
if (coin == 1)
return(rnorm(1)) else return(rnorm(1, mean = 4, sd = 1))
}
nsim = 20000
y = replicate(nsim, single_draw())
hist(y)
As you can see, the histogram has two “humps”, one centered around \(0\) and the other centered around \(4\). The first one is taller, which reflects the higher weight (\(\alpha=2/3\)) that \(N(0,1)\) has in this mixture.
If you wanted to write a more succinct vectorized code (which is not necessarily faster in this case), you could also do something like this
nsim = 10000
alpha = 2/3
x1 = rnorm(nsim)
x2 = rnorm(nsim, mean = 4, sd = 1)
coin = sample(c(TRUE, FALSE), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
y = ifelse(coin, x1, x2)
The function ifelse is a vectorized version of the
if-then blok and takes three arguments of equal length. The
first one is a vector of logical values c, and the other
two, x1, x2 only need to be of the same type. The result of
is a vector whose value at the position i is
x1[i] if c[i]==TRUE and x2[i]
otherwise.
The Laplace distribution can be understood as a mixture, with
\(\alpha=1/2\), of two distributions.
The first one is an exponential, and the second one is obtained from by
putting the minus sign in front of it.
Using our strategy from part 1. above, we could get simulations of it as
follows:
nsim = 100000
alpha = 1/2
x1 = rexp(nsim)
x2 = -rexp(nsim) # note the minus sign in front of rexp
coin = sample(c(TRUE, FALSE), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
y = ifelse(coin, x1, x2)
hist(y)
You can do this more efficiently if you realize that every time we
toss a coin and choose between x1 and x2, we
are really choosing the sign in front of an exponentially distributed
random variable. In other words, we can use coin as a
vector of random signs for a vector or draws from the exponential
distribution:
nsim = 10000
alpha = 1/2
x = rexp(nsim)
coin = sample(c(-1, 1), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
y = coin * x
hist(y)
Let x=rnorm(1000) and y=rnorm(1000). For
each of the following pairs, use the permutation test to decide whether
they are independent or not
x^2+y^2 and y^2(x+y)/sqrt(2) and (x-y)/sqrt(2)x and 1x^2+y^2 and atan(y/x).(Note: do not worry about dividing by \(0\) in d. It will happen with probability \(0\).)
Let us start by writing a function to save some keystrokes
permutation_test = function(z, w) {
par(mfrow = c(2, 2))
plot(z, w, asp = 1)
plot(z, sample(w), asp = 1)
plot(z, sample(w), asp = 1)
plot(z, sample(w), asp = 1)
}
x = rnorm(1000)
y = rnorm(1000)
permutation_test(x^2 + y^2, y^2)
The first plot is very different from the other three. Therefore,the vectors are probably not independent.
permutation_test((x + y)/sqrt(2), (x - y)/sqrt(2))
The first plot could easily be confused for one of the other three. Therefore the vectors are probably independent.
# we have to use rep(1,length(x)) to get a vector of 1s of the same length as
# x. R will not recycle it properly if you simply write 1. Another, more
# 'hacky' way would be to take advantage of recycling and use 0*x+1
permutation_test(x, rep(1, length(x)))
The plots look very similar. Therefore, the vectors are probably independent. We could have known this without drawing any graphs. Anything is independent of a constant random variable (vector).
permutation_test(x^2 + y^2, atan(y/x))
Plots look very similar to each other. Therefore, z and
w are probably independent.
Note: The plots in b) and d) reveal that the distribution of the random vector \((X,Y)\) consisting of two independent standard normals is probably rotation invariant. In b) we are asked to compare the coordinates of the vector obtained from \((X,Y)\) by a rotation at \(45\) degrees around the origin. The fact that independence persisted suggests that components remain independent even after a (specific) rotation. If you tried rotations by different angles you would get the same result. The experiment in d) told us that the (squared) distance \(X^2+Y^2\) and angle between \((X,Y)\) and the \(x\)-are independent. This is also something that one would expect from a rotationally-invariant distribution. Indeed, the distribution of the distance to the origin should not depend on the direction.
It is important to note that none of this proves anything. It is simply numerical evidence for a given conclusion.
Simulate \(n=10000\) draws from the joint distribution given by the following table:
| 1 | 2 | 3 | |
|---|---|---|---|
| 1 | 0.1 | 0.0 | 0.3 |
| 2 | 0.1 | 0.1 | 0.0 |
| 3 | 0.0 | 0.0 | 0.4 |
Display the contingency table of your results, as well as a table showing the “errors”, i.e., differences between the theoretical frequencies (i.e., probabilities given above) and the obtained relative frequencies in the sample.
We are using the procedure from Section 2.2 in the notes.
nsim = 10000
joint_distribution_long = data.frame(
x = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
y = c(1, 2, 3, 1, 2, 3, 1, 2, 3)
)
probabilities_long =
c(0.1, 0.0, 0.3,
0.1, 0.1, 0.0,
0.0, 0.0, 0.4)
sampled_rows = sample(
x = 1:nrow(joint_distribution_long),
size = nsim,
replace = TRUE,
prob = probabilities_long
)
draws = joint_distribution_long[sampled_rows,]
(rel_freq = prop.table(table(draws)))
## y
## x 1 2 3
## 1 0.0998 0.0000 0.3017
## 2 0.0993 0.1005 0.0000
## 3 0.0000 0.0000 0.3987
(th_freq = matrix(probabilities_long, byrow = TRUE, nrow = 3))
## [,1] [,2] [,3]
## [1,] 0.1 0.0 0.3
## [2,] 0.1 0.1 0.0
## [3,] 0.0 0.0 0.4
(err = rel_freq - th_freq)
## y
## x 1 2 3
## 1 -0.0002 0.0000 0.0017
## 2 -0.0007 0.0005 0.0000
## 3 0.0000 0.0000 -0.0013
Estimate the following integrals using Monte Carlo
\(\int_0^1 \cos(x)\, dx\)
\(\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}\frac{e^{-x^2/2}}{1+x^4}\,dx\)
\(\int_0^{\infty} e^{-x^3-x}\, dx\)
\(\int_{-\infty}^{\infty} \frac{\cos(x^2)}{1+x^2}\, dx\) (extra credit)
The idea here is to use the “fundamental theorem of statistics” \[ {\mathbb{E}}[ g(X) ] = \int g(x)\, f_X(x)\, dx \] where \(f_X\) is the pdf of \(X\) and \(g\) is any reasonably well-behaved function. Normally, one would use the integral on the right to compute the expectation on the left. We are flipping the logic, and using the expectation (which we can approximate via Monte Carlo) to estimate the integral on the right.
We pick \(g(x) = \cos(x)\) and \(X\) a r.v. with a uniform distribution on \((0,1)\), so that \(f_X(x) = 1\) for \(x\in (0,1)\) and \(0\) otherwise:
nsim = 10000
x = runif(nsim)
y = cos(x)
mean(y)
## [1] 0.8426696
For comparison, the exact value of the integral is \(\sin(1) \approx 0.841471\).
We cannot use the uniform distribution anymore, because the limits of integration are \(\pm \infty\). Part of the expression inside the integral can be recognized as a (standard) normal density, so we take \(X \sim N(0,1)\) and \(g(x) = 1/(1+x^4)\)
nsim = 10000
x = rnorm(nsim)
y = 1/(1 + x^4)
mean(y)
## [1] 0.6786274
The “exact” value (i.e., very precise approximation to this integral obtained using another numerical method) is \(0.676763\).
We integrate \(g(x) = \exp(-x^3)\) against the exponential pdf \(f_X(x) = \exp(-x)\), for \(x>0\):
nsim = 10000
x = rexp(nsim)
y = exp(-x^3)
mean(y)
## [1] 0.5680925
A close approximation of the true value is \(0.56889\).
In this case, a possible choice of the distribution for \(X\) is the Cauchy distribution (no worries if you never heard about it), whose pdf is \(f_X(x) = \frac{1}{\pi(1+x^2)}\), so that \(g(x) = \pi \cos(x^2)\):
nsim = 10000
x = rcauchy(nsim)
y = pi * cos(x^2)
mean(y)
## [1] 1.320216
The “exact” value is \(1.30561\).
The tricylinder is a solid body constructed as follows: create three cylinders of radius 1 around each of the three coordinate axes and intersect them:
Use Monte Carlo to estimate the volume of the tricylinder and check your estimate against the exact value \(8(2-\sqrt{2})\).
By the very construction, it is clear that the entire tricylinder lies within the cube \([-1,1]\times [-1,1] \times[-1,1]\). Therefore, we can compute its volume by simulating random draws from the uniform distribution in that cube, and computing the relative frequence of those values that fall inside the tricylinder. The whole point is that it is easy to check, given a point \((x,y,z)\), whether it lies inside the tricylinder or not. Indeed, the answer is “yes” if and only if all three of the following inequalities are satisfied: \[ x^2+y^2 \le 1,\ x^2+z^2\leq 1 \text{ and } y^2+z^2\leq 1.\]
nsim = 10000
x = runif(nsim, min = -1, max = 1)
y = runif(nsim, min = -1, max = 1)
z = runif(nsim, min = -1, max = 1)
is_in = (x^2 + y^2 <= 1) & (x^2 + z^2 <= 1) & (y^2 + z^2 <= 1)
(2^3 * sum(is_in)/nsim)
## [1] 4.708
We multiplied by \(2^3\) because that is the volume of the cube \([-1,1]\times [-1,1] \times [-1,1]\). Without it, we would get the portion of the cube taken by the tricylinder, and not its volume.
The true value of \(8(2-\sqrt{2})\) is, approximately, \(4.6862\).
Read about the Monty Hall Problem online (the introduction to its Wikipedia page has a nice description), Use Monte Carlo to compare the two possible strategies (switching and not-switching) and decide which is better.
The host knows where the car is and what contestant’s guess is. If
those two are the same (i.e., contestant guessed right), he will choose
one of the two remaining doors at random. If not, he simply shows the
contestant the other door with the goat behind it. This exactly what the
function show_door implements:
show_door = function(car, guess) {
all_doors = c(1, 2, 3)
goat_doors = all_doors[all_doors != car]
if (car == guess) {
random_goat_door = sample(goat_doors, size = 1)
return(random_goat_door)
} else {
the_other_goat_door = goat_doors[goat_doors != guess]
return(the_other_goat_door)
}
}
Next, we write a function which simulates the outcome of a single
game. It will have one argument, switch which will
determine whether the contestant switches the door or not.
one_game = function(switch) {
all_doors = c(1, 2, 3)
car = sample(all_doors, size = 1)
guess = sample(all_doors, size = 1)
if (switch) {
unguessed_doors = all_doors[all_doors != guess]
shown_door = show_door(car, guess)
switched_guess = unguessed_doors[unguessed_doors != shown_door]
return(switched_guess == car)
} else {
return(guess == car)
}
}
Finally we run two batches of \(10,000\) simulations, one with
switch=TRUE and another with switch=FALSE:
nsim = 10000
switch_doors = replicate(nsim, one_game(TRUE))
dont_switch_doors = replicate(nsim, one_game(FALSE))
(prob_with_switching = mean(switch_doors))
## [1] 0.668
(prob_without_switching = mean(dont_switch_doors))
## [1] 0.3288
Therefore, the probability of winning after switching is about double the probability of winning without switching. Switching is good for you!
(A philosophical note: this was the most “agnostic” approach to this
simulation. Simulations can often be simplified with a bit of insight.
For example, we could have realized that the switching strategy simply
flips the correctness of the guess (from “correct” to “wrong” and vice
versa) and used it to write a much shorter answer. Ultimately, we could
have realized that, because the probability of the initial guess being
correct is \(1/3\), switching leads to
a correct guess in \(2/3\) of the cases
(and not switching in only \(1/3\) of
the cases). In this case, the whole code would be
sample(c("correct", "incorrect"), size=10000, prob= c(2/3,1/3), replacement=TRUE),
which is an extremely inefficient way to estimate the value of the
number \(2/3\)!)
Find the beta distribution in R’s help system. The only thing you
need to know about it is that it comes with two positive parameters
\(\alpha\) and \(\beta\) (called shape1 and
shape2 in R). Simulate \(n=10000\) draws from the beta distribution
with parameters \(\alpha=0.5\) and
\(\beta=0.3\). Plot a density histogram
of your simulations as well as the pdf of the underlying distribution on
the same plot. (Note: if you use the command lines instead
of plot for the pdf, it will be automatically added to your
previous plot - the histogram, in this case).
Let \(X\) be a beta-distributed random variable with parameters \(\alpha=0.5\) and \(\beta=0.3\) (as above), and let \(Y\) be an independent random variable with the same distribution. Estimate the probability \({\mathbb{P}}[ XY > 1+ \log(Y)]\). Check graphically whether \(XY\) and \(1+\log(Y)\) are independent random variables.
Let \(X\), \(Y\) and \(Z\) be three independent random variables uniformly distributed on \((0,1)\), and let \(M = \min(X,Y,Z)\). A theorem in probability states that \(M\) has a beta distribution with some parameters \(\alpha\) and \(\beta\). Use simulation (and associated plots) to make an educated guess about what the values of \(\alpha\) and \(\beta\) are. (Hint: they are nice round numbers.)
x = rbeta(10000, shape1 = 0.5, shape2 = 0.3)
hist(x, probability = T)
x_pdf = seq(0, 1, by = 0.01)
y_pdf = dbeta(x_pdf, 0.5, 0.3)
lines(x_pdf, y_pdf, type = "l", col = "red", lwd = 2)
We use Monte Carlo to estimate the probability:
x = rbeta(10000, shape1 = 0.5, shape2 = 0.3)
y = rbeta(10000, shape1 = 0.5, shape2 = 0.3)
z = x * y > 1 + log(y)
mean(z)
## [1] 0.4457
and the permuation test to check for (in)dependence
w = x * y
z = 1 + log(y)
par(mfrow = c(2, 2))
z_perm_1 = sample(z)
z_perm_2 = sample(z)
z_perm_3 = sample(z)
plot(w, z)
plot(w, z_perm_1)
plot(w, z_perm_2)
Since the first plot looks very different from the other three, we conclude that \(XY\) and \(1+\log(Y)\) are most likely not independent.
First, we simulate \(10,000\)
draws from the distribution of \(M\).
The function min is not vectorized (and it should not be!),
so we cannot simply write \(m =
min(x,y,z)\). Luckily there is another function, called
pmin (where p stands for parallel) which
returns the component-wise min. Alternatively, we could use the function
apply to apply min to each row of the data
frame df which contains the simulations of \(X,Y\) and \(Z\):
nsim = 10000
x = runif(nsim)
y = runif(nsim)
z = runif(nsim)
m = pmin(x, y, z)
# or, alternatively,
df = data.frame(x, y, z)
m = apply(df, 1, min)
The idea is to compare the histogram of (the simulations) of \(M\) to pdfs of the beta distributions for
various values of parameters and see what fits best. The function
try_parameters below does exacly that - it superimposes the
beta pdf onto the histogram of m (just like in part 1.
above). We try a few different values, and finally settle on \(\alpha=1\) and \(\beta=3\), because it seems to fit will. It
turns out that \(\alpha=1\) and \(\beta=3\) is, indeed, correct.
try_parameters = function(alpha, beta) {
hist(m, probability = T, main = paste("Trying alpha = ", alpha, " beta = ", beta))
x_pdf = seq(0, 1, by = 0.01)
y_pdf = dbeta(x_pdf, alpha, beta)
lines(x_pdf, y_pdf, type = "l", col = "red", lwd = 2)
}
par(mfrow = c(3, 2))
try_parameters(3, 2)
try_parameters(2, 2)
try_parameters(1, 0.5)
try_parameters(0.5, 2)
try_parameters(0.5, 3)
try_parameters(1, 3)
Alternatively, you could have looked up the mean and the variance of the beta distribution online, and obtained the following expressions: \[ \text{ mean}= \frac{\alpha}{\alpha+\beta}, \text{ variance} = \frac{\alpha\beta}{(\alpha+\beta)^2 (\alpha+\beta+1)},\] and then tried to find \(\alpha\) and \(\beta\) to match the estimated mean and variance
c(mean(m), var(m))
## [1] 0.24952292 0.03719576
The first equation tells us that \(\alpha/(\alpha+\beta)\) is about \(1/4\), i.e., \(3 \alpha \approx \beta\). We plug the obtained value of \(\beta\) into the equation for variance to get the following equation: \[ 0.0372 \approx \frac{ 3 \alpha^2}{ (4\alpha)^2 (4 \alpha+1)^2},\] so that \[ 4 \alpha + 1 \approx \frac{3}{ 16\times 0.0372} \approx 5.04, \text{ i.e., } \alpha \approx 1 \text{ and } \beta \approx 3.\] This estimation technique - where the mean and variance (and possibly higher moments) are computed and matched to parameters - is called the method of moments.
In case you are curious, here is how to derive the theorem from the problem (and check that our guess is, indeed, correct). We note that for any \(x\in (0,1)\), \(M> x\) if and only if \(X>x\), \(Y>x\) and \(Z>x\). Therefore, by independence of \(X,Y\) and \(Z\), we have \[ {\mathbb{P}}[ M > x ] = {\mathbb{P}}[ X>x, Y>x, Z>x] = {\mathbb{P}}[X>x] \times {\mathbb{P}}[Y>x] \times {\mathbb{P}}[Z>x] = (1-x)^3.\] From there, we conclude that the cdf of \(M\) is given by \[F(x) = {\mathbb{P}}[M\leq x] = 1- {\mathbb{P}}[M>x] = 1 - (1-x)^3.\] Since F is a smooth function, we can differentiate it to get the pdf: \[ f(x) = F'(x) = 3 (1-x)^2,\] which is exactly the pdf of the beta distribution with parameters \(3\) and \(1\) (see the Wikipedia page of the beta distribution for this and many other facts).
Find the gamma distribution in R’s help system. The only thing
you need to know about it is that it comes with two positive parameters
\(k\) and \(\theta\) (called shape and
scale in R). Simulate \(n=10000\) draws from the gamma distribution
with parameters \(k=2\) and \(\theta=1\). Plot a density histogram of
your simulations as well as the pdf of the underlying distribution on
the same plot. Repeat for \(3\) more
choices (left to you) of the parameters \(k\) and \(\theta\).
Let \(X\) be a gamma-distributed random variable with parameters \(k=2\) and \(\theta=1\) (as above), and let \(Y_1, Y_2\) be a independent standard normals (also independent of \(X\)). Estimate the probability \({\mathbb P}[ X > Y_1^2 + Y_2^2 ]\) using Monte Carlo. Check graphically whether \(XY_1\) and \(X Y_2\) are independent.
Let \(X\), \(Y\) and \(Z\) be three independent random variables with the gamma distribution and the scale parameter \(\theta=1\). Their shape parameters are, however different, and equal to \(1,2\) and \(3\), respectively. Use simulation (and associated plots) to make an educated guess about the distribution of \(X+Y+Z\) and its parameters. (Hint: the parameters will be nice round numbers.)
hist_with_pdf <- function(k, theta) {
x = rgamma(10000, shape = k, scale = theta)
hist(x,
probability = T,
breaks = 50,
xlim = c(0,15), ylim = c(0, 0.8),
main = paste("shape = ", k, ", scale = ", theta)
)
x_pdf = seq(0, 15, by = 0.01)
y_pdf = dgamma(x_pdf, shape = k, scale = theta)
lines(x_pdf, y_pdf,
type = "l",
col = "red",
lwd = 2)
}
par(mfrow = c(2, 2))
hist_with_pdf(2,1)
hist_with_pdf(2,0.5)
hist_with_pdf(3,1)
hist_with_pdf(3,0.5)
We use Monte Carlo to estimate the probability:
x = rgamma(10000, shape = 2, scale = 1)
y1 = rnorm(10000)
y2 = rnorm(10000)
z = x > y1^2 + y2^2
mean(z)
## [1] 0.5508
and the permuation test to check for (in)dependence
w = x * y1
z = x * y2
par(mfrow = c(2, 2))
z_perm_1 = sample(z)
z_perm_2 = sample(z)
z_perm_3 = sample(z)
plot(w, z)
plot(w, z_perm_1)
plot(w, z_perm_2)
The common shape of plots 2, 3. and 4. is sufficiently different from the shape of plot 1. to conclude that \(XY_1\) and \(XY_2\) are most likely not independent. Note: if you were to compute the correlation between \(XY_1\) and \(XY_2\) you would get \(0\) - this is an example of uncorrelated random variables that are, nevertheless, not independent.
First, we simulate \(10,000\) draws from the distribution of \(S = X + Y + Z\) and plot the histogram of obtained values:
nsim = 10000
x = rgamma(nsim, shape = 1, scale = 1)
y = rgamma(nsim, shape = 2, scale = 1)
z = rgamma(nsim, shape = 3, scale = 1)
s = x + y + z
hist(s, breaks = 100)
The shape resembles the shape we obtained in 1. above, so we try the
gamma distribution with various parameters. The function
try_parameters below does exacly that - it superimposes the
gamma pdf onto the histogram of s.
try_parameters = function(k, theta) {
hist(s, probability = T, breaks = 50, main = paste("Trying shape = ", k, " scale = ",
theta))
x_pdf = seq(0, 15, by = 0.01)
y_pdf = dgamma(x_pdf, k, theta)
lines(x_pdf, y_pdf, type = "l", col = "red", lwd = 2)
}
par(mfrow = c(3, 3))
try_parameters(1, 1)
try_parameters(2, 1)
try_parameters(1, 2)
try_parameters(1, 3)
try_parameters(2, 1)
try_parameters(4, 1)
We conclude that \(X+Y+Z\) is most
likely gamma-distributed with \(k=6\)
and \(\theta=1\). This is indeed, the
case. The sum of independen gammas with shape parameters \(k_1, \dots, k_n\) and the same scale
parameter \(\theta\) is a gamma with
parameters \(k = k_1+\dots+k_n\) and
\(\theta\).
We learned how to simulate from a joint distribution of two discrete vectors \((X,Y\)) by thinking of it as one-dimensional distribution but with values represented by pairs of numbers. Here is another way this can be done:
Find the marginal distribution of one of them, say \(X\), and simulate from it
Given the value you just obtained, let’s call it \(x\), simulate from the conditional distribution of \(Y\), given \(X=x\).
| 1 | 2 | 3 | |
|---|---|---|---|
| 1 | 0.1 | 0.0 | 0.3 |
| 2 | 0.1 | 0.1 | 0.0 |
| 3 | 0.0 | 0.0 | 0.4 |
Display the contingency table of your simulations, first using counts, and then using relative frequencies. Compare to the theoretical values (i.e., the probabilities in the table above).
margin_X = c(0.2, 0.1, 0.7 )
cond_Y_X = matrix(
c( 0.5, 0.0, 3/7,
0.5, 1.0, 0.0,
0.0, 0.0, 4/7),
byrow=TRUE,
nrow=3)
single_draw = function() {
x = sample(c(1,2,3), size=1, prob=margin_X)
y = sample(c(1,2,3), size=1, prob=cond_Y_X[,x])
return(c(x,y))
}
nsim=10000
df = data.frame(
t(replicate(nsim, single_draw()))
)
colnames(df) = c("x","y")
t(table(df))
## x
## y 1 2 3
## 1 962 0 2963
## 2 1012 1023 0
## 3 0 0 4040
The variables margin_X and cond_X_Y are
what you get when you compute the marginal and the conditional
distribution from the given joint-distribution table as you did in your
probability class.
The function single_draw performs a single draw form the
distribution of \((X,Y)\) by first
drawing the value of \(X\) from its
marginal distribution. Then, it chooses the row of the conditional
distribution table according to the obtained value of \(X\) and simulates from it.
The function replicate is used to repeat
single_draw many times and collect the results. By default,
replicate attaches the output of each new “replication” as
a new column and not a row, so we need to transpose the final product.
That is what the function t() is for. We turn the result
into a data frame because the function table knows how to
handle data frames automatically. Another use of the transpose gives
x the horizontal axis, and y the vertical one,
like in the statement of the problem.
Let \((Z,W)\) is be a discrete random vector with the following joint distribution table (different rows correspond to different values of \(Z\)):
| -1 | 0 | 1 | |
|---|---|---|---|
| -1 | 0.2 | 0.1 | 0.0 |
| 0 | 0.1 | 0.2 | 0.1 |
| 1 | 0.0 | 0.1 | 0.2 |
and set \[ X = Z+W \text{ and } Y=Z W.\]
Find the distribution table for \((X,Y)\) and compute \({\mathbb{E}}[ X Y]\) and \({\mathbb{P}}[X=0 | Y=0]\) analytically (no simulations!). You can do this part on a separate piece of paper or inside your Rmd file if you know LaTeX.
Draw \(10000\) simulations of \((X,Y)\) (don’t print them out). Display the contingency table of your simulations using relative frequencies, as well as the table of errors (differences between your contingency table and the table of probabilities).
Compute \({\mathbb{E}}[X Y]\) again, but this time using your simulations from 1. above.
Draw simulations from the conditional distribution of \(X\), given \(Y=0\). Use them to estimate the conditional probability \({\mathbb{P}}[ X=0 | Y=0]\). By how much does it differ from your analytical result from 1. above.
| 0 | 1 | |
|---|---|---|
| -2 | 0.0 | 0.2 |
| -1 | 0.2 | 0.0 |
| 0 | 0.2 | 0.0 |
| 1 | 0.2 | 0.0 |
| 2 | 0.0 | 0.2 |
To compute $\EE[XY]$ we compute the value of the product $i j$ for each entry in
the table above, multiply it by the probability there and sum the obtained
results. The only non-zero terms are $1 \times (-2) \times 0.2$ and $1 \times 2
\times 0.2$, so $$ \EE[ XY ] = -2 \times 0.2 + 2 \times 0.2 = 0.$$ Finally, $$
\PP[ X=0 | Y=0] = \frac{\PP[ X=0, Y=0]}{ \PP[ Y=0]} = \frac{ 0.2}{ 0.2 + 0.2 +
0.2} = \frac{1}{3}.$$
We borrow the code from the notes to draw simulations from the distribution of \((W,Z)\), first:
joint_distribution_long = data.frame(
z = c(-1,-1,-1, 0,0,0, 1,1,1),
w = c(-1,0,1, -1,0,1, -1,0,1) )
probabilities_long = c(0.2,0.1,0.0,0.1,0.2,0.1,0.0,0.1,0.2);
sampled_rows = sample(
x = 1:nrow(joint_distribution_long),
size = 10000,
replace = TRUE,
prob = probabilities_long )
d_zw = joint_distribution_long[sampled_rows, ]
Then we transform the results into \((X,Y)\) and output the table of relative frequencies:
draws_xy = data.frame(x = d_zw$z + d_zw$w, y = d_zw$z * d_zw$w)
(cont_table = prop.table(table(draws_xy)))
## y
## x 0 1
## -2 0.0000 0.2059
## -1 0.1993 0.0000
## 0 0.1987 0.0000
## 1 0.1940 0.0000
## 2 0.0000 0.2021
We build a matrix containing the probabilities we derived in 1. above
and subtract it from cont_table:
prob_table = matrix(c(0, 0.2, 0.2, 0.2, 0, 0.2, 0, 0, 0, 0.2), nrow = 5)
(error_table = cont_table - prob_table)
## y
## x 0 1
## -2 0.0000 0.0059
## -1 -0.0007 0.0000
## 0 -0.0013 0.0000
## 1 -0.0060 0.0000
## 2 0.0000 0.0021Monte Carlo:
mean(draws_xy$x * draws_xy$y)
## [1] -0.0076First we remove all rows from draws_xy whose \(y\)-component is not \(0\)
draws_cond = draws_xy[draws_xy$y == 0, ]
Then we compute the relative frequency of the remaining draws where \(x=0\):
(est_prob = mean(draws_cond$x == 0))
## [1] 0.3356419
The true value is \(1/3\), so the error is given by
est_prob - 1/3
## [1] 0.002308559Harry Potter’s cousin Nigel Potter owns a magical set of three dice. They behave just like any three (fair, independent) 6-sided dice, except for the fact that, when thrown together, they always show three distinct numbers. In other words, the outcomes with repeating numbers never happen, while all combinations of three distinct numbers are equally likely.
Create a large (\(\ge 10,000\)) set of simulations of throws of these magical dice. Use any method you like.
Output the contingency table for the outcomes of the first two dice, and compare to theoretical probabilities.
Draw samples from the conditional distribution of the outcome of the third die, given that the sum on the first two is \(6\). Display the contingency table for your outcomes.
(Extra credit) Find the (theoretical) conditional distribution of the third die given that the sum on the first two is \(6\), and compare it to your result from 3. above.
Here are three different methods you can use to do this problem:
Sampling without replacement:
The three magical dice sample without replacement from the set \(\{1,2,3,4,5,6\}\), so we can simply do the following:
nsim = 10000
sims = data.frame(t(replicate(nsim, sample(c(1, 2, 3, 4, 5, 6), size = 3, replace = FALSE))))
colnames(sims) = c("x", "y", "z")
The rejection method:
We simulate three regular dice, and then simply reject all outcomes where two of the numbers are the same:
nsim = 25000
x0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
y0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
z0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
good = (x0 != y0) & (y0 != z0) & (x0 != z0)
sims = data.frame(x = x0[good], y = y0[good], z = z0[good])
nsim = dim(sims)[1] # we rejected a random number of draws
Using sample:
We need to make the list of all allowed combinations (no repeats) and
then sample the entire triplet of dice at once. I am using
rbind to append a row to a data frame, but you can do this
in many other ways.
nsim = 100000
df = data.frame()
for (i in 1:6) {
for (j in 1:6) {
for (k in 1:6) {
if ((i != j) & (j != k) & (i != k)) {
df = rbind(df, c(i, j, k))
}
}
}
}
colnames(df) = c("x", "y", "z")
rows = as.numeric(row.names(df)) # as.numeric needed to turn strings into integers
sampled_rows = sample(rows, nsim, replace = TRUE)
sims = df[sampled_rows, ]The probability of seeing a pair \((i,j)\) is \(0\) if \(i=j\) and \(1/30\) otherwise (where \(30\) is the number of pairs of different numbers).
table_sim = table(sims$x, sims$y)/nsim
table_th = matrix(1/30, ncol = 6, nrow = 6)
for (i in 1:6) table_th[i, i] = 0
(error = table_sim - table_th)
##
## 1 2 3 4 5 6
## 1 0.000000 0.000267 0.000177 0.000047 -0.000653 0.001067
## 2 -0.000073 0.000000 -0.000623 0.000927 -0.000743 0.000207
## 3 -0.000843 0.000217 0.000000 -0.000233 -0.000543 0.000777
## 4 0.000017 0.000637 -0.000473 0.000000 0.000557 -0.000453
## 5 0.000017 -0.000373 -0.000343 0.000277 0.000000 0.000077
## 6 -0.000293 0.001307 -0.000783 -0.000023 -0.000113 0.000000good = (sims$x + sims$y == 6)
z_cond = sims$z[good]
(table_sim = table(z_cond)/length(z_cond))
## z_cond
## 1 2 3 4 5 6
## 0.13 0.12 0.25 0.12 0.12 0.25We can reuse the code from above (simulation using
sample) to list all possible elementary outcomes with \(x+y=6\) and then count the frequencies of
different values of the third die in this list (or we could simply do
that on a piece of paper):
df = data.frame()
for (i in 1:6) {
for (j in 1:6) {
for (k in 1:6) {
if ((i != j) & (j != k) & (i != k) & (i + j == 6)) {
df = rbind(df, c(i, j, k))
}
}
}
}
colnames(df) = c("x", "y", "z")
(table_th = table(df$z)/length(df$z))
##
## 1 2 3 4 5 6
## 0.12 0.12 0.25 0.12 0.12 0.25
The error is given by:
options(digits = 2)
(error = table_sim - table_th)
## z_cond
## 1 2 3 4 5 6
## 0.00520 -0.00017 -0.00279 -0.00255 -0.00173 0.00205Exactly one percent of the people in a given population have a certain disease. The accuracy of the diagnostic test for it is such that it detects the sick as sick with probability \(0.95\) and the healthy as healthy with probability \(0.9\). A person chosen at random from the population tested positive. What is the probability the he/she is, in fact, sick. Do the problem both analytically and by Monte Carlo.
The person can test positive (denoted by \(tS\) in the plot) in two ways. By actually being sick (\(S\)) and then testing positive, or by being healthy and then testing positive. Bayes formula (or simply a look at the picture above) gives us \[ {\mathbb{P}}[ S | tS ] = \frac{ 0.01 \times 0.95}{ 0.01\times 0.95 + 0.1\times 0.99 } \approx 0.088.\] Thus, even when the test is quite accurate, the probability of getting a false positive is very high.
Let us do the same via Monte Carlo. We proceed like this. First we
“pick a person” from the population by sampling from
c("H", "S") and then “test” this person. After repeating
this nsim times, we condition on the positive test, by
removing all draws where the test was negative. This leaves us with a
population of people who tested positive, and we simply need to see what
proportion of those were are actually sick.
single_draw = function() {
x = sample(c("H", "S"), size = 1, prob = c(0.99, 0.01))
if (x == "H") {
y = sample(c("tH", "tS"), size = 1, prob = c(0.9, 0.1))
} else {
y = sample(c("tH", "tS"), size = 1, prob = c(0.05, 0.95))
}
return(c(x, y))
}
nsim = 100000
df = data.frame(t(replicate(nsim, single_draw())))
colnames(df) = c("status", "test_result")
cond = (df$test_result == "tS")
df_cond = df[cond, ]
(prob = mean(df_cond$status == "S"))
## [1] 0.088
A point is chosen at random, uniformly in the unit cube \([0,1]\times [0,1]\times [0,1]\). Its distance to the origin \((0,0,0)\) is measured, and turns out to be equal to \(1.5\).
Use simulations to estimate the shape of the pdf of the conditional distribution of the point’s distance to \((1,1,1)\). Compare it to the unconditional case, i.e., the case where no information about the distance to \((0,0,0)\) is known.
Compute the mean of this (conditional) distribution for a few values of the parameter \(\varepsilon\) you use to deal with conditioning in the continuous case. Make sure you include values of \(\varepsilon\) on both sides of the spectrum - too big, and too small.
We want to vary the parameter eps later, so let’s write
a function first:
simulate = function(nsim, eps, conditional) {
x = runif(nsim)
y = runif(nsim)
z = runif(nsim)
d1 = sqrt((1 - x)^2 + (1 - y)^2 + (1 - z)^2)
if (conditional) {
d0 = sqrt(x^2 + y^2 + z^2)
cond = (d0 > 1.5 - eps) & (d0 < 1.5 + eps)
return(d1[cond])
} else {
return(d1)
}
}
Histograms may be used as approximations to the pdf of the (conditional) distribution:
nsim = 1000000
eps = 0.1
d1_cond = simulate(nsim, eps, conditional = TRUE)
d1 = simulate(nsim, eps, conditional = FALSE)
par(mfrow = c(1, 2))
hist(d1, breaks = 50)
hist(d1_cond, breaks = 50)
Note that, in addition to clearly different shapes, the supports of the two distributions differ, too. Unconditionally, the distance to \((1,1,1)\) can be any number from \(0\) to \(\sqrt{3} \approx 1.73\). If it is known that the distance to \((0,0,0)\) is \(1.5\), however, the distance to \((1,1,1)\) cannot be larger than \(1\).
Finally, let us compare the results we obtain by varying the
parameter \(\varepsilon\), first with
nsim=100000:
nsim = 100000
epss = c(2, 1, 0.5, 0.3, 0.2, 0.1, 0.02, 0.01, 0.001, 0.0001)
d1s = vector(length = length(epss))
for (eps in epss) {
sims = simulate(nsim, eps = eps, conditional = TRUE)
print(paste("Eps = ", eps, ", Draws = ", length(sims), " Mean = ", mean(sims)))
}
## [1] "Eps = 2 , Draws = 100000 Mean = 0.960342228444094"
## [1] "Eps = 1 , Draws = 93544 Mean = 0.929726634933677"
## [1] "Eps = 0.5 , Draws = 47806 Mean = 0.753283074403369"
## [1] "Eps = 0.3 , Draws = 20224 Mean = 0.595529938750379"
## [1] "Eps = 0.2 , Draws = 10658 Mean = 0.496362640888308"
## [1] "Eps = 0.1 , Draws = 3930 Mean = 0.36209037346251"
## [1] "Eps = 0.02 , Draws = 647 Mean = 0.310139851430518"
## [1] "Eps = 0.01 , Draws = 354 Mean = 0.30926108659397"
## [1] "Eps = 0.001 , Draws = 33 Mean = 0.334224126737702"
## [1] "Eps = 0.0001 , Draws = 4 Mean = 0.275311021507976"
The same experiment, but with nsim=1000000 yields:
nsim = 1000000
epss = c(2, 1, 0.5, 0.3, 0.2, 0.1, 0.02, 0.01, 0.001, 0.0001)
d1s = vector(length = length(epss))
for (eps in epss) {
sims = simulate(nsim, eps = eps, conditional = TRUE)
print(paste("Eps = ", eps, ", Draws = ", length(sims), " Mean = ", mean(sims)))
}
## [1] "Eps = 2 , Draws = 1000000 Mean = 0.960506260771782"
## [1] "Eps = 1 , Draws = 934686 Mean = 0.928432784386909"
## [1] "Eps = 0.5 , Draws = 477123 Mean = 0.753640516540863"
## [1] "Eps = 0.3 , Draws = 201634 Mean = 0.596081456550207"
## [1] "Eps = 0.2 , Draws = 103279 Mean = 0.494799342061533"
## [1] "Eps = 0.1 , Draws = 38529 Mean = 0.361335849359528"
## [1] "Eps = 0.02 , Draws = 6944 Mean = 0.3064588479179"
## [1] "Eps = 0.01 , Draws = 3371 Mean = 0.30560770769072"
## [1] "Eps = 0.001 , Draws = 331 Mean = 0.301734543251831"
## [1] "Eps = 0.0001 , Draws = 33 Mean = 0.31500400469211"
Simulate \(10000\) draws from a uniform distribution inside the cube \([-1,1]\times[-1,1] \times [-1,1]\). Go through your simulations, and discard the ones that do not lie within the unit ball (the ball centered around \((0,0,0)\), with radius \(1\).) Now you have a bunch of uniform simulations from the unit ball (Do not display them).
Use your simulations to estimate the mean and the standard deviation of \(W\), where \(W\) is the distance from the the origin to a randomly and uniformly chosen point in the unit ball.
A point was chosen randomly and uniformly in the unit ball, and it has been observed that it is closer to the point \((1,-1,1)\) than to \((1,-1,-1)\). Estimate (graphically) the shape of the conditional pdf of its distance to the origin, given this observation.
(extra credit) Do 3., but with the point chosen uniformly over the surface of the ball and for the distance to the point \((1,1,1)\) and not to the origin (which is always \(1\) in this case). Even more extra points if you do not draw any additional simulations.
We draw the \(x\), \(y\) and \(z\) coordinates independet of each other, and uniformly in \([-1,1]\). Then we discard those simulatione where the sum of the squares is \(>1\):
nsim = 10000
x_cube = runif(nsim, min = -1, max = 1)
y_cube = runif(nsim, min = -1, max = 1)
z_cube = runif(nsim, min = -1, max = 1)
in_ball = x_cube^2 + y_cube^2 + z_cube^2 <= 1
x = x_cube[in_ball]
y = y_cube[in_ball]
z = z_cube[in_ball]By Monte Carlo
W = sqrt(x^2 + y^2 + z^2)
mean(W)
## [1] 0.75
sd(W)
## [1] 0.19
Note that the mean is not half the radius, i.e. \(0.5\). Why?
We add two new variables, \(d1\) and \(d2\) - distances to \((1,-1,1)\) and \((1,-1,-1)\), respectively:
d1 = sqrt((x - 1)^2 + (y + 1)^2 + (z - 1)^2)
d2 = sqrt((x - 1)^2 + (y + 1)^2 + (z + 1)^2)
condition = d1 < d2
W_cond = W[condition]
hist(W_cond, breaks = 50, probability = T)
The main difficulty in this part of the problem is getting simulations from the uniform distributions on the sphere. One possibility is to use the rejection method, and proceed as in part 1., but keep only those points for which \[1-\varepsilon < \sqrt{x^2+y^2+z^2} < 1+\varepsilon\] for some (small) \(\varepsilon\). There is a more efficient method, though, and it does not involve any new simulations. We simply take our uniformly distributed points inside the unit ball and project each one onto the sphere, i.e. replace it by the closest point on the surface. This is easily achieved since the projection of the point \((x,y,z)\) is \[\Big( \frac{x}{\sqrt{x^2+y^2+z^2}}, \frac{y}{\sqrt{x^2+y^2+z^2}}, \frac{y}{\sqrt{x^2+y^2+z^2}}, \Big).\]
A moment’s though will convince you that those points indeed cover the sphere uniformly (no direction is preferred!).
Since we already have the distance to the origin stored in the
variable W, the code will be extremely simple:
x_s = x/W
y_s = y/W
z_s = z/W
The rest now parallels what happened in 2.
d1 = sqrt((x_s - 1)^2 + (y_s + 1)^2 + (z_s - 1)^2)
d2 = sqrt((x_s - 1)^2 + (y_s + 1)^2 + (z_s + 1)^2)
condition = d1 < d2
W1 = sqrt((x - 1)^2 + (y - 1)^2 + (z - 1)^2)
W1_cond = W1[condition]
hist(W1_cond, breaks = 50, probability = T)
A stochastic process is a sequence - finite or infinite - of random variables. We usually write \(\{X_n\}_{n\in{\mathbb{N}}_0}\) or \(\{X_n\}_{0\leq n \leq T}\), depending on whether we are talking about an infinite or a finite sequence. The number \(T\in {\mathbb{N}}_0\) is called the time horizon, and we sometimes set \(T=+\infty\) when the sequence is infinite. The index \(n\) is often interpreted as time, so that a stochastic process can be thought of as a model of a random process evolving in time. The initial value of the index \(n\) is often normalized to \(0\), even though other values may be used. This it usually very clear from the context.
It is important that all the random variables \(X_0, X_1,\dots\) “live” on the same sample space \(\Omega\). This way, we can talk about the notion of a trajectory or a sample path of a stochastic process: it is, simply, the sequence of numbers \[X_0(\omega), X_1(\omega), \dots\] but with \(\omega\in \Omega\) considered “fixed”. In other words, we can think of a stochastic process as a random variable whose value is not a number, but sequence of numbers. This will become much clearer once we introduce enough examples.
A stochastic process \(\{X\}_{n\in{\mathbb{N}}_0}\) is said to be a simple symmetric random walk (SSRW) if
\(X_0=0\),
the random variables \(\delta_1 = X_1-X_0\), \(\delta_2 = X_2 - X_1\), …, called the steps of the random walk, are independent
each \(\delta_n\) has a coin-toss distribution, i.e., its distribution is given by \[{\mathbb{P}}[ \delta_n = 1] = {\mathbb{P}}[ \delta_n=-1] = \tfrac{1}{2} \text{ for each } n.\]
Some comments:
This definition captures the main features of an idealized notion of a particle that gets shoved, randomly, in one of two possible directions, over and over. In other words, these “shoves” force the particle to take a step, and steps are modeled by the random variables variables \(\delta_1,\delta_2, \dots\). The position of the particle after \(n\) steps is \(X_n\); indeed, \[X_n = \delta_1 + \delta_2 + \dots + \delta_n \text{ for }n\in {\mathbb{N}}.\] It is important to assume that any two steps are independent of each other - the most important properties of random walks depend on this in a critical way.
Sometimes, we only need a finite number of steps of a random walk, so we only care about the random variables \(X_0, X_1,\dots, X_T\). This stochastic process (now with a finite time horizon \(T\)) will also be called a random walk. If we want to stress that the horizon is not infinite, we sometimes call it the finite-horizon random walk. Whether \(T\) is finite or infinite is usually clear from the context.
The starting point \(X_0=0\) is just a normalization. Sometimes we need more flexibility and allow our process to start at \(X_0=x\) for some \(x\in {\mathbb{N}}\). To stress that fact, we talk about the random walk starting at \(x\). If no starting point is mentioned, you should assume \(X_0=0\).
We will talk about biased (or asymmetric) random walks a bit later. The only difference will be that the probabilities of each \(\delta_n\) taking values \(1\) or \(-1\) will be \(p\in (0,1)\) and \(1-p\), and not necessarily \(\tfrac{1}{2}\), The probability \(p\) cannot change from step to step and the steps \(\delta_1, \delta_2, \dots\) will continue to be independent from each other.
The word simple in its name refers to the fact that distribution of every step is a coin toss. You can easily imagine a more complicated mechanism that would govern each step. For example, not only the direction, but also the size of the step could be random. In fact, any distribution you can think of can be used as a step distribution of a random walk. Unfortunately, we will have very little to say about such, general, random walks in these notes.
In addition to being quite simple conceptually, random walks are also easy to simulate. The fact that the steps \(\delta_n = X_n - X_{n-1}\) are independent coin tosses immediately suggests a feasible strategy: simulate \(T\) independent coin tosses first, and then define each \(X_n\) as the sum of the first \(n\) tosses.
Before we implement this idea in R, let us agree on a few conventions which we will use whenever we simulate a stochastic process:
data.frame
objectX0, X1, X2, etc.This is best achieved by the following two-stage approach in R:
write a function which will simulate a single trajectory of your process, If your process comes with parameters, it is a good idea to include them as arguments to this function.
use the function replicate to stack together many
such simulations and convert the result to a data.frame.
Don’t forget to transpose after (use the function t)
because replicate works column by column, and not row by
row.
Let’s implement this in the case of a simple random walk. Of course,
it is impossible to simulate a random walk on an infinite horizon (\(T=\infty\)) so we must restrict to
finite-horizon random walks6. The function cumsum which
produces partial sums of its input comes in very handy.
single_trajectory = function(T, p = 0.5) {
delta = sample(c(-1, 1), size = T, replace = TRUE, prob = c(1 - p, p))
x = cumsum(delta)
return(x)
}
Next, we run the same function nsim times and record the
results. It is a lucky break that the default names given to columns are
X1, X2, … so we don’t have to rename them. We
do have to add the zero-th column \(X_0=0\) because, formally speaking, the
“random variable” \(X_0=0\) is a part
of the stochastic process. This needs to be done before other columns
are added to maintain the proper order of columns, which is important
when you want to plot trajectories.
simulate_walk = function(nsim, T, p = 0.5) {
return(
data.frame(
X0 = 0,
t(replicate(nsim, single_trajectory(T, p)))
))
}
walk = simulate_walk(nsim = 10000, T = 500)
Now that we have the data frame walk, we can explore in
at least two qualitatively different ways:
Here we focus on individual random variables (column) or pairs, triplets, etc. of random variables and study their (joint) distributions. For example, we can plot histograms of the random variables \(X_5, X_8, X_{30}\) or \(X_{500}\):
We can also use various (graphical or not) devices to understand joint distributions of pairs of random variables:
If we focus on what is going on in a given row of walk,
we are going to see a different cross-section of our stochastic process.
This way we are fixing the state of the world \(\omega\) (represented by a row of
walk), i.e., the particular realization of our process, by
varying the time parameter. A typical picture associated to a trajectory
of a random walk is the following
You can try to combine the two approaches (if you must) and plot several trajectories on the same plot. While this produces pretty pictures (and has one or two genuine applications), it usually leads to a sensory overload. Note that the trajectories on the righr are jittered a bit. That means that the positions of the points are randomly shifted by a small amount. This allows us to see features of the plot that would otherwise be hidden because of the overlap.
The row-wise (or path-wise or trajectory-wise) view of the random walk described above illustrates a very important point: the random walk (and random processes in general) can be seen as random “variable” whose values are not merely numbers; they are sequences of numbers (trajectories). In other words, a random process is simply a “random trajectory”. We can simulate this random trajectory as we did above, but simulating the steps and adding them up, but we could also take a different approach. We could build the set of all possible trajectories, and then pick a random trajectory out of it.
For a random walk on a finite horizon \(T\), a trajectory is simply a sequence of natural numbers starting from \(0\). Different realizations of the coin-tosses \(\delta_n\) will lead to different trajectories, but not every sequence of natural numbers corresponds to a trajectory. For example \((0,3,4,5)\) is not possible because the increments of the random walk can only take values \(1\) or \(-1\). In fact, a finite sequence \((x_0, x_1, \dots, x_T)\) is a (possible) sample path of a random walk if and only if \(x_0=0\) and \(x_{k}-x_{k-1} \in \{-1,1\}\) for each \(k\). For example, when \(T=3\), there are \(8\) possible trajectories: \[ \begin{align} \Omega = \{ &(0,1,2,3), (0,1,2,1),(0,1,0,2), (0,1,0,-1), \\ & (0,-1,-2,-3), (0,-1,-2,-1), (0,-1,0,-2), (0,-1,0,1)\} \end{align}\] When you (mentally) picture them, think of their graphs:
Each trajectory corresponds to a particular combination of the values
of the increments \((\delta_1,\dots,
\delta_T)\), each such combination happens with probability \(2^{-T}\). This means that any two
trajectories are equally likely. That is convenient, because this puts
uniform probability on the collection of trajectories. We are now ready
to implement our simulation procedure in R; let us write the function
single_trajectory using this approach and use it to
simulate a few trajectories. We assume that a function
all_paths(T) which returns a list of all possible paths
with horizon \(T\) has already been
implemented (more info about a possible implementation in R is given in
a problem below):
T=5
Omega = all_paths(T)
single_trajectory = function() {
return(unlist(sample(size=1,Omega)))
}
simulate_walk = function(nsim, p=0.5) {
return(data.frame(
X0=0,
t(replicate(nsim, single_trajectory()))
))
}
Building a path space is not simply an exercise in abstraction. Here is how we can use is to understand the distribution of the position of the random walk:
Let \(X\) be a simple symmetric random walk with time horizon \(T=5\). What is the probability that \(X_{5}=1\)?
Let \(\Omega\) be the path space, i.e., the set of all possible trajectories of length \(5\) - there are \(2^{5}=32\) of them. The probability that \(X_{5}=1\) is the probability that a randomly picked path from \(\Omega\) will take the value \(1\) at \(n=5\). Since all paths are equally likely, we need to count the number of paths with value \(1\) at \(n=5\) and then divide by the total number of paths, i.e., \(32\).
So, how many paths are there that take value \(1\) at \(n=5\)? Each path is built out of steps of absolute value \(1\). Some of them go up (call them up-steps) and some of them go down (down-steps). A moment’s though reveals that the only way to reach \(1\) in \(5\) steps is if you have exactly \(3\) up-steps and \(2\) down-steps. Conversely, any path that has \(3\) up-steps and \(2\) down-steps ends at \(1\).
This realization transforms the problem into the following: how many paths are there with exactly \(3\) up-steps (note that we don’t have to specify that there are \(2\) down-steps - it will happen automatically). The only difference between different paths with exactly \(3\) up-steps is the position of these up-steps. In some of them the up-steps happen right at the start, in some at the very end, and in some they are scattered around. Each path with \(3\) up-steps is uniquely determined by the list of positions of those up-steps, i.e., with a size-\(3\) subset of \(\{1,2,3,4,5\}\). This is not a surprise at all, since each path is built out of increments, and positions of positive increments clearly determine values of all increments.
The problem has now become purely mathematical: how many size-\(3\) subsets of \(\{1,2,3,4,5\}\) are there? The answer comes in the form of a binomial coefficient \(\binom{5}{3}\) whose value is \(10\) - there are exactly ten ways to pick three positions out of five. Therefore, \[ {\mathbb{P}}[ X_{5} = 1] = 10 \times 2^{-5} = \frac{5}{16}.\]
Can we do this in general?
Let \(X\) be a simple symmetric random walk with time horizon \(T\). What is the probability that \(X_{n}=k\)?
The reasoning from the last example still applies. A trajectory with \(u\) up-steps and \(d\) down-steps will end at \(u-d\), so we must have \(u-d=k\). On the other hand \(u+d=n\) since all steps that are not up-steps are necessarily down-steps. This gives as a simple linear system with two equations and two unknowns which solves to \(u = (n+k)/2\), \(d=(n-k)/2\). Note the \(n\) and \(k\) must have the same parity for this solution to be meaningful. Also, \(k\) must be between \(-n\) and \(n\).
Having figured out how many up-steps is necessary to reach \(k\), all we need to do is count the number of trajectories with that many up-steps. Like before, we can do that by counting the number of ways we can choose their position among \(n\) steps, and, like before, the answer is the binomial coefficient \(\binom{n}{u}\) where \(u=(n+k)/2\). Dividing by the total number of trajectories gives us the final answer: \[ {\mathbb{P}}[ X_n = k ] = \binom{n}{ (n+k)/2} 2^{-n},\] for all \(k\) between \(-n\) and \(n\) with same parity as \(n\). For all other \(k\), the probability is \(0\).
The binomial coefficient and the \(n\)-th power suggest that the distribution of \(X_n\) might have something to do with the binomial distribution. It is clearly not the binomial, since it can take negative values, but it is related. To figure out what is going on, let us first remember what the binomial distribution is all about. Formally, it is a discrete distribution with two parameters \(n\in{\mathbb{N}}\) and \(p\in (0,1)\). Its support is \(\{0,1,2,\dots, n\}\) and the distribution is given by the following table, where \(q=1-p\)
| 0 | 1 | 2 | … | k | … | n |
|---|---|---|---|---|---|---|
| \(\binom{n}{0} q^n\) | \(\binom{n}{1} p q^{n-1}\) | \(\binom{n}{2} p^2 q^{n-2}\) | … | \(\binom{n}{k} p^k q^{n-k}\) | … | \(\binom{n}{n} p^n\) |
The binomial distribution is best understood, however, when it is expressed as a “number of successes”. More precisely,
If \(B_1,B_2,\dots, B_n\) are \(n\) independent Bernoulli random variables with the same parameter \(p\), then their sum \(B_1+\dots+B_n\) has a binomial distribution with parameters \(n\) and \(p\).
We think of \(B_1, \dots, B_n\) as indicator random variables of “successes” in \(n\) independent “experiments” each of which “succeeds” with probability \(p\). A canonical example is tossing a biased coin \(n\) times and counting the number of “heads”.
We know that the position \(X_n\) at time \(n\) of the random walk admits the representation \[ X_n = \delta_1+\delta_2+\dots+\delta_n,\] just like the binomial random variable. The distribution of \(\delta_k\) is not Bernoulli, though, since it takes the values \(-1\) and \(1\), and not \(0,1\). This is easily fixed by applying the linear transformation \(x\mapsto \frac{1}{2}(x+1)\); indeed \(( -1 +1)/2 = 0\) and \(( 1 + 1) / 2 =1\), and, so, \[ \frac{1}{2}(\delta_k+1)\text{ is a Bernoulli random variable with parameter } p=\frac{1}{2}.\] Consequently, if we add all \(B_k = \tfrac{1}{2}(1+\delta_k)\) and remember our discussion from above we get the following statement
In a simple symmetric random walk the random variable \(\frac{1}{2} (n + X_n)\) has the binomial distribution with parameters \(n\) and \(p=1/2\), for each \(n\).
Can you use that fact to rederive the distribution of \(X_n\)?
If the steps of the random walk preferred one direction to the other, the definition would need to be tweaked a little bit and the word “symmetric” in the name gets replaced by “biased” (or “asymmetric”):
A stochastic process \(\{X\}_{n\in{\mathbb{N}}_0}\) is said to be a **simple biased random walk with parameter \(p\in (0,1)\) if
\(X_0=0\),
the random variables \(\delta_1 = X_1-X_0\), \(\delta_2 = X_2 - X_1\), …, called the steps of the random walk, are independent and
each \(\delta_n\) has a biased coin-toss distribution, i.e., its distribution is given by \[{\mathbb{P}}[ \delta_n = 1] = p \text{ and } {\mathbb{P}}[ \delta_n=-1] = 1-p \text{ for each } n.\]
As far as the distribution of \(X_n\) is concerned, we don’t expect it to be the same as in the symmetric case. After all, the biased random walk (think \(p=0.999\)) will prefer one direction over the other. Our trick with writing \(\frac{1}{2}(n+X_n)\) as a sum of Bernoulli random variables still works. We just have to remember that \(p\) is not \(\frac{1}{2}\) anymore to conclude that \(\tfrac{1}{2}(X_n + n)\) has the binomial distribution with parameters \(n\) and \(p\); if we put \(u = (n+k)/2\) we get \[\begin{align} {\mathbb{P}}[ X_n = k] &= {\mathbb{P}}[ \tfrac{1}{2}(X_n+n) = u] = \binom{n}{u} p^u q^{n-u}\\ & = \binom{n}{\frac{1}{2}(n+k)} p^{\frac{1}{2}(n+k)} q^{\frac{1}{2}(n-k)}. \end{align}\] Note that be binomial coefficient stays the same as in the symmetric case, but the factor \(2^{-n} = (1/2)^{\frac{1}{2}(n+k)} (1/2)^{\frac{1}{2}(n-k)}\) becomes \(p^{\frac{1}{2}(n+k)} q^{\frac{1}{2}(n-k)}\).
Can we reuse the sample space \(\Omega\) to build a biased random walk? Yes, we can, but we need to assign possibly different probabilities to individuals. Indeed, if \(p=0.99\), the probability that all the increments \(\delta\) of a \(10\)-step random walk take the value \(+1\) is \((0.99)^{10} \approx 0.90\). This is much larger than the probability that all steps take the value \(-1\), which is \((0.01)^{10}= 10^{-20}\).
In general, the probability that a particular path is picked out of \(\Omega\) will depend on the number of up-steps and down-steps; more precisely it equals \(p^u q^{n-u}\) where \(u\) is the number of up-steps. The interesting thing is that the number of up-steps \(u\) depends only on the final position \(x_n\) of the path; indeed \(u = \frac{1}{2}(n+x_n)\). This way, all paths of length \(T=5\) that end up at \(1\) get the same probability of being chosen, namely \(p^3 q^2\). Let us use the awful seizure-inducing graph with multiple paths for good, and adjust the each path according to its probability; some jitter has been added to deal with overlap. The lighter-colored paths are less likely to happen then the darker-colored paths.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Which of the following processes are simple random walks?
\(\{2 X_n\}_{n\in {\mathbb{N}}_0}\) ?
\(\{X^2_n\}_{n\in {\mathbb{N}}_0}\) ?
\(\{-X_n\}_{n\in {\mathbb{N}}_0}\) ?
\(\{ Y_n\}_{n\in {\mathbb{N}}_0}\), where \(Y_n = X_{5+n}-X_5\) ?
How about the case \(p\ne \tfrac{1}{2}\)?
No - the support of the distribution of \(X_1\) is \(\{-2,2\}\) and not \(\{-1,1\}\).
No - \(X_1^2=1\), and not \(\pm 1\) with equal probabilities.
Yes - all parts of the definition check out.
Yes - all parts of the definition check out.
The answers are the same if \(p\ne \tfrac{1}{2}\), but, in 3., \(-X_n\) comes with probability \(1-p\) of an up-step, and not \(p\).
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk.
Find the distribution of the product \(X_1 X_2\)
Compute \({\mathbb{P}}[ |X_1 X_2 X_3|]=2\)
Find the probability that \(X\) will hit neither the level \(2\) nor the level \(-2\) until (and including) time \(T=3\)
Find the independent pairs of random variables among the following choices:
| 0 | 2 |
|---|---|
| 0.5 | 0.5 |
\(|X_1 X_2 X_3|=2\) only in the following two cases \[ X_1=1, X_2=2, X_3=1 \text{ or } X_1=-1, X_2=-2, X_3=-1.\] Each of those paths has probability \(1/8\) of happening, so \({\mathbb{P}}[ |X_1 X_2 X_3| = 2] = 1/4\).
The only chance for \(X\) to hit \(2\) or \(−2\) before or at T = 3 is at time \(n = 2\). Since \(X_2 \in \{ -2, 0, 2\}\), this happens with probability \({\mathbb{P}}[ X_2 \in \{-2,2\}] = 1 - {\mathbb{P}}[X_2 = 0] = 0.5\).
The only independent pair is \(X_4 - X_2\) and \(X_6 - X_5\) because the two random variables are build out of completely different increments: \(X_4 - X_2 = \delta_3+\delta_4\) while \(X_6-X_5 = \delta_6\). The others are not independent. For example, if we are told that \(X_1+X_3 = 4\), it necessarily follows that \(\delta_1= \delta_2=\delta_3=1\). Hence, \(X_2+X_4 = 2\delta_1+2\delta_2+\delta_3+\delta_4 = 5+\delta_4\) which cannot be less than \(4\). On the other hand, without any information, \(X_2+X_4\) can easily be negative.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk.
Compute \({\mathbb{P}}[ X_{32} = 4| X_8 = 6]\).
Compute \({\mathbb{P}}[ X_9 = 3 \text{ and } X_{15}=5 ]\)
(extra credit) Compute \({\mathbb{P}}[ X_7 + X_{12} = X_1 + X_{16}]\)
This is the same as \({\mathbb{P}}[ X_{32}- X_8 = -2 | X_8=6]\). The random variables \(X_8\) and \(X_{32}-X_8\) are independent (as they are built out of different \(\delta\)s), so we can remove the conditioning. It remains to compute \({\mathbb{P}}[X_{32} - X_8 = -2]\). For that, we note that \(X_{32} - X_8\) is a sum of \(24\) independent coin tosses, so its distribution is the same as that of \(X_{24}\). Therefore, by our formula for the distribution of \(X_n\), we have \[ {\mathbb{P}}[X_{32}= 4 | X_8 = 6] = {\mathbb{P}}[X_{24} = -2] = \binom{24}{11} 2^{-24}.\]
We have \[\begin{align} {\mathbb{P}}[ X_9 = 3 \text{ and } X_{15}=5 ] & = {\mathbb{P}}[ X_{15} = 5 | X_9 = 3] \times {\mathbb{P}}[ X_9=3] \\ & = {\mathbb{P}}[ X_6 = 2] \times {\mathbb{P}}[X_9=3] = \binom{6}{4} 2^{-6} \binom{9}{6} 2^{-9}, \end{align}\] where we used the same ideas as in 1. above
We rewrite everything using \(\delta\)s: \[\begin{align} X_7+X_{12} = X_1+X_{16} &\Leftrightarrow X_7-X_1 = X_{16}-X_{12} \Leftrightarrow \delta_2+\dots+\delta_7 = \delta_{13} + \dots+\delta_{16}\\ & \Leftrightarrow (-\delta_2) + \dots + (-\delta_7) + \delta_{13}+ \dots + \delta_{16} = 0. \end{align}\] Since \(-\delta_k\) has the same distribution as \(\delta_k\) (both are coin tosses) and remains independent of all other \(\delta_i\), the left-hand side of the last expression in the chain of equivalences above is a sum of \(10\) indepenedent coin tosses. Therefore, the probability that it equals \(0\) is the same as \({\mathbb{P}}[X_{10}=0] = \binom{10}{5} 2^{-10}\).
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk. For \(n\in{\mathbb{N}}\) compute the probability that \(X_{2n}\), \(X_{4n}\) and \(X_{6n}\) take the same value.
Increments \(X_{4n}-X_{2n}\) and \(X_{6n} - X_{4n}\) are independent, and each is a sum of \(2n\) independent coin tosses (therefore has the same distribution as \(X_{2n}\)). Hence, \[\begin{aligned} {\mathbb{P}}[ X_{2n} = X_{4n} \text{ and }X_{4n} = X_{6n} ] &= {\mathbb{P}}[ X_{4n} - X_{2n} = 0 \text{ and }X_{6n} - X_{4n} = 0 ]\\ &= {\mathbb{P}}[ X_{4n} - X_{2n} = 0] \times {\mathbb{P}}[ X_{6n} - X_{4n}=0]\\ &={\mathbb{P}}[ X_{2n}=0] \times {\mathbb{P}}[ X_{2n} =0 ]\\ & = \binom{2n}{n} 2^{-2n} \binom{2n}{n} 2^{-2n} = \binom{2n}{n}^2 2^{-4n}. \end{aligned}\]
Write an R function (call it all_paths) which takes an
integer argument T and returns a list of all possible paths
of a random walk with time horizon \(T\). (Note: Since vectors cannot have other
vectors as elements, you will need to use a data structure called
list for this. It behaves very much like a vector, so it
should not be a problem.)
The implementation below uses the function combn which
returns the list of all subsets of a certain size of a certain vector.
Since each path is determined by the positions of its up-steps, we need
to loop through all numbers \(i\) from
\(0\) to \(T\) and then list all subsets of the size
\(i\). The next step is to turn a set
of positions to a path of a random walk. This can be done in many ways;
one is implemented implemented in choice_to_path using
vector indexing.
choice_to_path = function(comb, T) {
increments = rep(-1, T)
increments[comb] = 1
path = cumsum(increments)
return(path)
}
all_paths = function(T) {
Omega = list(2^T)
index = 1
for (i in 0:T) {
choices = combn(T, i, simplify = FALSE)
for (choice in choices) {
Omega[[index]] = choice_to_path(choice, T)
index = index + 1
}
}
return(Omega)
}
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Given \(n\in{\mathbb{N}}_0\) and \(k\in{\mathbb{N}}\), compute \(\operatorname{Var}[X_n]\), \(\operatorname{Cov}[X_n, X_{n+k}]\) and \(\operatorname{corr}[X_n, X_{n+k}]\), where \(\operatorname{Cov}\) stands for the covariance and \(\operatorname{corr}\) for the correlation. (Note: look up \(\operatorname{Cov}\) and \(\operatorname{corr}\) if you forgot what they are).
Compute \(\lim_{n\to\infty} \operatorname{corr}[X_n, X_{n+k}]\) and \(\lim_{k\to\infty} \operatorname{corr}[X_n, X_{n+k}]\). How would you interpret the results you obtained?
We have \(\operatorname{Var}[\delta_i] = 1\) for each \(i\in{\mathbb{N}}\), so \[\operatorname{Var}[X_n] = \sum_{i=1}^n \operatorname{Var}[\delta_i] = n.\] Since \({\mathbb{E}}[X_n] = {\mathbb{E}}[X_{n+k}]=0\) and \(X_{n+k} - X_n\) is independent of \(X_n\), we have \[\begin{aligned} \operatorname{Cov}[X_n,X_{n+k}] &= {\mathbb{E}}[ X_n X_{n+k}] = {\mathbb{E}}[ X_n (X_{n+k} - X_n)] + {\mathbb{E}}[X_n^2] = {\mathbb{E}}[X_n] {\mathbb{E}}[X_{n+k} - X_n] + {\mathbb{E}}[X_n^2]\\ &= {\mathbb{E}}[X_n^2] = n. \end{aligned}\] Finally, \[\begin{aligned} \operatorname{corr}[X_n, X_{n+k}] = \frac{\operatorname{Cov}[X_n, X_{n+k}]}{\sqrt{\operatorname{Var}[X_n]} \sqrt{\operatorname{Var}[X_{n+k}]}} = \frac{n}{\sqrt{n(n+k)}} = \sqrt{\frac{n}{n+k}}. \end{aligned}\] When we let \(n\to\infty\), we get \(1\). This means that the positions of the random walk, \(k\) steps apart, get closer and close to perfect correlation as \(n\to\infty\). If you know \(X_n\) and \(n\) is large, you almost know \(X_{n+k}\), at least at the typical scale of \(X_n\).
When we let \(k\to\infty\), we get \(0\). That means that as the gap between two points in time gets larger, the values get less and less correlated. In a sense, the random walk tends to forget its past after a large number of steps.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk with \({\mathbb{P}}[X_1=1]=p\in (0,1)\), and let \(A_n\) be the (signed) area under its graph (in the picture below, \(A_n\) is the area of the blue part minus the area of the orange part).
Find a formula for \(A_n\) in terms of \(X_1,\dots, X_n\).
Compute \({\mathbb{E}}[A_n]\) and \(\operatorname{Var}[A_n]\), for \(n\in{\mathbb{N}}\). (You will find the following formulas helpful \(\sum_{j=1}^n j = \frac{n(n+1)}{2}\) and \(\sum_{j=1}^n j^2=\frac{n(n+1)(2n+1)}{6}\).)
Use simulations to approximate the entire distribution of \(A_n\) (set \(p=0.3\), \(n=10\) run \(10000\) simulations): display the entire distribution table. Next, compute Monte-Carlo approximations to \({\mathbb{E}}[A_{10}]\) and \(\operatorname{Var}[A_{10}]\) and compare them to the exact values obtained in 2. above.
The dashed lines divide the area “under” the graph in separate trapezoids, so \(A_n\) is the sum of their areas. The trapezoid between \(X_{k-1}\) and \(X_{k}\) has area \(1 \times (X_{k-1}+X_{k})/2\), so \[ A_n = \sum_{k=1}^n \tfrac{1}{2} (X_{k-1}+X_k) = X_1+X_2+\dots+X_{n-1} + \tfrac{1}{2}X_n.\]
Let us first represent \(A_n\) in terms of the sequence \(\{\delta_n\}_{n\in{\mathbb{N}}_0}\) \[\begin{align} A_n &= (\delta_1) + (\delta_1+\delta_2) + \dots + (\delta_1+\dots + \delta_{n-1}) + \tfrac{1}{2}(\delta_1+ \dots + \delta_n)\\ &= (n-\tfrac{1}{2}) \delta_1 + (n-1-\tfrac{1}{2}) \delta_2 + \dots + \tfrac{1}{2}\delta_n. \end{align}\] We compute \({\mathbb{E}}[\delta_k]=p-q\) so that, by the formulas from the problem, \[\begin{align} {\mathbb{E}}[A_n]&= \sum_{j=1}^n (j-\tfrac{1}{2}) {\mathbb{E}}[\delta_{n-j}] = (p-q) \Big( \tfrac{1}{2}n(n+1) - \tfrac{1}{2}n\Big)\\ & = \frac{p-q}{2} n^2 \end{align}\] Just like above, but relying on the independence of \(\{\delta_n\}\) and the fact that \(\operatorname{Var}[\delta_k]=1-(2p-1)^2=4pq\), we have \[\begin{align} \operatorname{Var}[A_n] &= \sum_{j=1}^n \operatorname{Var}[(j - \tfrac{1}{2}) \delta_{n-j}] = \sum_{j=1}^n (j-\tfrac{1}{2})^2 \operatorname{Var}[\delta_k] \\& = 4pq \sum_{j=1}^n (j-\tfrac{1}{2})^2 = 4pq \Big( \sum_{j=1}^n j^2 - \sum_{j=1}^n j + \frac{1}{4} n \Big)\\ & = 4pq \Big( \frac{n}{n+1}{(2n+1)}{6} - \frac{n (n+1)}{2} + \frac{n}{4}) = \frac{pq}{3} ( 4 n^3 - n) \end{align}\]
First, we draw \(10000\)
simulations of a simple random walk. For that we use the function
simulate_walk from the notes above:
walk = simulate_walk(nsim = 10000, T = 6, p = 0.3)
Next, we use the formula \(A_{6} = X_1+
\dots+X_5 + \tfrac{1}{2}X_{6}\) from 1. to create the vector
A6, which will hold \(10000\) simulations of \(A_{6}\). We use the function
apply to apply the function sum to each row of
walk (that is what the MARGIN = 1 argument
means; settingMARGIN = 2, would produce the vector of
column sums):
A6 = apply(walk, MARGIN = 1, FUN = sum) - 0.5 * walk$X6
Next, we create the table of relative frequencies of occurrences of
each value in A6. This table will serve as an approximation
to the true distribution of \(A_{6}\):
(dist_est = prop.table(table(A6)))
## A6
## -18 -17 -15 -14 -13 -12 -11 -10 -9 -8 -7
## 0.1223 0.0503 0.0510 0.0199 0.0529 0.0233 0.0515 0.0415 0.0608 0.0429 0.0540
## -6 -5 -4 -3 -2 -1 0 1 2 3 4
## 0.0675 0.0169 0.0407 0.0293 0.0468 0.0269 0.0275 0.0278 0.0317 0.0278 0.0076
## 5 6 7 8 9 10 11 12 13 14 15
## 0.0169 0.0103 0.0112 0.0073 0.0112 0.0087 0.0015 0.0036 0.0019 0.0034 0.0012
## 17 18
## 0.0014 0.0005
Using the formulas derived in 2. above, we get the following exact values \[\begin{align} {\mathbb{E}}[A_6]&= \frac{0.3 - 0.7}{2} 6^2 = - 7.5 \\ \operatorname{Var}[A_6] & \frac{0.3 \times 0.7}{3} ( 4 6^3 - 6) = 60.06, \end{align}\]
and Monte Carlo gives us the following estimates and their relative errors:
(expectation_true = -7.5)
## [1] -7.5
(expectation_est = mean(A6))
## [1] -7.3
(variance_true = 60.06)
## [1] 60
(variance_est = sd(A6)^2)
## [1] 60
(rel_error_expectation = (expectation_true - expectation_est)/expectation_true)
## [1] 0.024
(rel_error_variance = (variance_true - variance_est)/variance_true)
## [1] 0.0079Let \(\{X_n\}_{0 \leq n \leq 100}\) be the simple symmetric random walk with time horizon \(T = 100\). We define the following random variables
Draw nsim simulations of \(L\), \(P\)
and \(R\) and display their histograms
on the same graph, side by side. Set the number of bins (option
breaks in the command hist) to \(50\). Choose the number nsim
so that the simulations take no more than 2 minutes, but do not go over
\(100,000\).
(Hint: For each of the three random variables write a
function which takes a single trajectory of a random walk as an argument
and returns its value for that trajectory. \(P\) is easy, and for \(L\) and \(R\) use the function match. Be
careful - match finds the first match, but you need the
last one. Then simulate the random walk and apply your
functions to the rows of your ouput data frame.)
What do you observe? What conjecture can you make about distributions of \(L\), \(P\) and \(R\)?
(Extra credit) Even though \(L\), \(P\) and \(R\) are discrete random variables, it turns out that distributions of their normalized versions \(L/T\), \(P/T\) and \(R/T\) are close to named continuous distributions. Guess what these distributions are and explain (graphically) why you think your guess is correct.
First, we simulate nsim=100,000 trajectories of
length T=100 of a simple symmetric random walk. We reuse
the function simulate_walk from the notes.
T = 100
nsim = 100000
df = simulate_walk(nsim, T)
Next, we write three functions. The input to each will be a
trajectory of a random walk, and the output will be the value of the
corresponding random variable (\(P\),
\(L\) or \(R\)) for that particular trajectory. The
function rev reverses its input.
compute_P = function(x) {
p = sum(x > 0)
return(p)
}
compute_L = function(x) {
L = length(rev(x)) - match(0, rev(x)) + 1
return(L)
}
compute_R = function(x) {
R = length(rev(x)) - match(max(x), rev(x)) + 1
return(R)
}
Then we apply each of the functions to each row of df
(the data frame that holds simulated trajectories of the walk)
R = apply(df, 1, compute_R)
L = apply(df, 1, compute_L)
P = apply(df, 1, compute_P)
and plot the histograms (ylim sets the visible range of
the \(y\) axis. We make all three the
same in order be able to compare the histograms better)
par(mfrow = c(1, 3))
hist(R, breaks = 50, prob = T, ylim = c(0, 0.04))
hist(L, breaks = 50, prob = T, ylim = c(0, 0.04))
hist(P, breaks = 50, prob = T, ylim = c(0, 0.04))
All three histograms look the same. Perhaps the random variables \(P\), \(L\) and \(R\) have the same distributions? That is quite strange, though, because they are constructed by very different procedures.
When we normalize the random variables \(P\),\(L\), and \(R\) by \(T=100\), i.e., consider \(P/100\), \(L/100\) and \(R/100\), we obtain almost the same histograms. The only difference is that the \(x\)-axis ranges from \(0\) to \(1\) and not from \(0\) to \(T=100\). This suggests to try and see if any of the named distributions with support on \([0,1]\) fits. Fortunately, there is only one non-esotheric family of distributions on \([0,1]\) and that is the beta family. If you fiddle around a bit with the two parameters you will quickly find that setting both of them to \(1/2\) fits our histograms very well (I am plotting only the histogram of \(R/T\); the others look very similar)
hist(R/100, breaks = 50, prob = T, main = "Histogram of R/100 with the pdf of B(1/2, 1/2) superimposed")
curve(dbeta(x, shape1 = 0.5, shape2 = 0.5), from = 0, to = 1, add = TRUE, col = 2,
lwd = 2)
That is, indeed, what is going on. As the number of steps \(T\) gets larger, the distributions of \(L/,T\) \(P/T\) and \(R/T\) all converge towards the same, beta,
distribution with parameters \(\alpha=1/2\) and \(\beta=1/2\). The exact meaning of the word
“converge” in the previous sentence, or the proof of this statement are
beyond the scope of this course, but you cannot argue with the fact that
the fit looks really good on the picture above.
Btw, the pdf of the \(B(1/2,1/2)\) distribution is given by \[\begin{align} f(x) = \frac{1}{\pi \sqrt{x(1-x)}} \text{ for } x \in [0,1]. \end{align}\] The cdf \(F\) of \(B(1/2,1/2)\) is therefore given by \[\begin{align} F(x) = \int_0^x \frac{1}{\pi\sqrt{y(1-y)}}\, dy = \frac{2}{\pi} \arcsin(\sqrt{x}) \text{ for } x\in [0,1]. \end{align}\] This is why the \(B(1/2,1/2)\)-distribution is sometimes called the arcsine-distribution and the mathematical theorem that states that \(P/T\), \(L/T\) and \(R/T\) all approach the \(B(1/2,1/2)\)-distribution as \(T\to\infty\) is called the arcsine law.
If you are interested in sports modeling, have a look at the following article where the arcsine law appears in an unusual context.
Counting trajectories in order to compute probabilities is a powerful method, as our next example shows. It also reveals a potential weakness of the combinatorial approach: it works best when all \(\omega\) are equally likely (e.g., when \(p=\tfrac{1}{2}\) in the case of the random walk).
We start by asking a simple question: what is the typical record value of the random walk, i.e., how far “up” (or “right” depending on your point of view) does it typically get? Clearly, the largest value it can attain is \(T\). This happens only when all coin tosses came up \(+1\), an extremely unlikely event - its probability is \(2^{-T}\). On the other hand, this maximal value is at least \(0\), since \(X_0=0\), already. A bit of thought reveals that any value between those two extremes is possible, but it is not at all easy to compute their probabilities.
More precisely, if \(\{X_n\}\) is a
simple random walk with time horizon \(T\). We define its running-maximum
process \(\{M_n\}_{n\in
{\mathbb{N}}_0}\) by \[M_n=\max(X_0,\dots, X_n),\ \text{ for }0 \leq n
\leq T,\] and ask what the probabilities \({\mathbb{P}}[M_n = k]\) for \(k=0,\dots, n\) are. An easy numerical
solution to this problem can be given by simulation. We reuse the
function simulate_walk defined at the beginning of the
chapter, but also employ a new function, called apply which
“applies” a function to each row (or column) of a data frame or a
matrix. It seems to be tailor-made for our purpose7 because we want to
compute the maximum of each row of the simulation matrix (remember - the
row means keep the realization fixed, but vary the time-index \(n\)). The syntax of apply is
simple - it needs the data frame, the margin (rows are coded as 1 and
columns as 2; so when the margin is 1, the function is applied row-wise
and when the margin is 2, the function is applied column-wise) and the
function to be applied (max in our case). The output is a
vector of size nsim with all row-wise maxima:
walk = simulate_walk(nsim = 100000, T = 12, p = 0.5)
M = apply(walk, 1, max)
hist(M, breaks = seq(-0.5, 12.5, 1), probability = TRUE)
The overall shape of the distribution is as we expected; the support is \(\{0,1,2,\dots, 12\}\) and the probabilities tend to decrease as \(k\) gets larger. The unexpected feature is that \({\mathbb{P}}[ M_{12} = 1]\) seems to be the same as \({\mathbb{P}}[ M_{12} = 2]\). It drops after that for \(k=3\), but it looks like \({\mathbb{P}}[ M_{12} = 3] = {\mathbb{P}}[ M_{12}=4]\) again. Somehow the probability does not seem to change at all from \(2i-1\) to \(2i\).
Fortunately, there is an explicit formula for the distribution of \(M_n\) and we can derive it by a nice counting trick known as the reflection principle.
As usual, we may assume without loss of generality that \(T=n\) since the values of \(\delta_{n+1}, \dots, \delta_T\) do not affect \(M_n\) at all. We start by picking a level \(l\in\{1,\dots, n\}\) and first compute the probability \({\mathbb{P}}[M_n\geq l]\) - it will turn out to be easier than attacking \({\mathbb{P}}[ M_n=l]\) directly. The symmetry assumption \(p=1/2\) ensures that all trajectories are equally likely, so we can do this by counting the number of trajectories whose maximal level reached is at least \(l\), and then multiply by \(2^{-n}\).
What makes the computation of \({\mathbb{P}}[M_n \geq l]\) a bit easier than that of \({\mathbb{P}}[ M_n = l]\) is the following equivalence
\[M_n\geq l \text{ if and only if } X_k=l \text{ for some } k.\]
In words, the set of trajectories whose maximum is at least \(l\) is exactly the same as the set of trajectories that hit the level \(l\) at some time. Let us denote the set of trajectories \(\omega\) with this property by \(A_l\), so that \({\mathbb{P}}[ M_n \geq l] = {\mathbb{P}}[A_l]\). We can further split \(A_l\) into three disjoint events \(A_l^{>}\), \(A_l^{=}\) and \(A_l^{<}\), depending on whether \(X_n<l\), \(X_n=l\) or \(X_n>l\). In the picture below, the red trajectory is in \(A_l^{>}\), the green trajectory in \(A_l^=\) the orange one in \(A_l^{<}\), while the blue one is not in \(A_l\) at all.
With the set of all trajectories \(\Omega\) partitioned into four disjoint classes, namely \(A^>_l, A^=_l, A^<_l\) and \((A_l)^c\), we are ready to reveal the main idea behind the reflection principle:
To see why that is true, start by choosing a trajectory \(\omega\in A_l^{>}\) and denoting by \(\tau_l(\omega)\) the first time \(\omega\) visits the level \(l\). Since \(\omega \in A^>\) such a time clearly exists. Then we associate to \(\omega\) another trajectory, call it \(\bar{\omega}\), obtained from \(\omega\) in the following way:
Equivalently the increments of \(\omega\) and \(\bar{\omega}\) are exactly the same up to time \(\tau(\omega)\), and exactly the opposite afterwards. In the picture below - the orange trajectory is \(\omega\) and the green trajectory is its “reflection” \(\bar{\omega}\); note that they overlap until time \(5\):
Convince yourself that this procedure establishes a bijection between the sets \(A_l^{>}\) and \(A_l^{<}\), making these two sets equal in size.
So why is it important to know that \(\#
A_l^> = \# A_l^<\)? Because the trajectories in \(A_l^>\) (as well as in \(A_l^=\)) are easy to count. For them, the
requirement that the level \(l\) is hit
at a certain point is redundant; if you are at or above \(l\) at the very end, you must have hit
\(l\) at a certain point.
Therefore, \(A_l^{>}\) is simply the
family of those trajectories \(\omega\)
whose final positions \(X_n(\omega)\)
are somewhere strictly above \(l\).
Hence, \[\begin{align}
{\mathbb{P}}[A_l^{>}] &= {\mathbb{P}}[ X_n=l+1 \text{ or }
X_n = l+2 \text{ or } \dots \text{ or }
X_n=n]\\ & = \sum_{k=l+1}^n {\mathbb{P}}[X_n = k]
\end{align}\]
Similarly, \[\begin{aligned} {\mathbb{P}}[ A_l^{=}] = {\mathbb{P}}[X_n=l].\end{aligned}\] Finally, by the reflection principle, \[\begin{aligned} {\mathbb{P}}[ A_l^{<}] = {\mathbb{P}}[A_l^{>}] = \sum_{k=l+1}^n {\mathbb{P}}[X_n=k].\end{aligned}\]
Putting all of this together, we get \[\begin{aligned} {\mathbb{P}}[ A_l ] = {\mathbb{P}}[ X_n=l] + 2 \sum_{k=l+1}^n {\mathbb{P}}[X_n=k],\end{aligned}\] so that \[\begin{aligned} {\mathbb{P}}[ M_n = l ] &= {\mathbb{P}}[ M_n \geq l] - {\mathbb{P}}[ M_n \geq l+1]\\ & = {\mathbb{P}} [A_l] - {\mathbb{P}} [A_{l+1}]\\ & = {\mathbb{P}}[ X_n = l] + 2 {\mathbb{P}}[X_n = l+1] + 2{\mathbb{P}}[X_n = l+2]+ \dots + 2{\mathbb{P}}[ X_n=n] -\\ & \qquad \qquad \quad \ - {\mathbb{P}}[ X_n = l+1] - 2 {\mathbb{P}}[X_n = l+2] - \dots - 2{\mathbb{P}}[ X_n=n]\\ &= {\mathbb{P}}[ X_n=l] + {\mathbb{P}}[X_n=l+1] \end{aligned}\]
Now that we have the explicit expression \[ {\mathbb{P}}[ M_n = l ] = {\mathbb{P}}[ X_n=l] + {\mathbb{P}}[X_n = l+1] \text{ for } l=0,1,\dots, n,\] we can shed some light on the fact on the shape of the histogram for \(M_n\) we plotted above. Since \({\mathbb{P}}[X_n=l]\) is \(0\) if \(n\) and \(l\) don’t have the same parity, it is clear that only one of the probabilities \({\mathbb{P}}[X_n=l]\) and \({\mathbb{P}}[X_n=l+1]\) can be positive. It follows that, for \(n\) even, we have \[\begin{align} {\mathbb{P}}[ M_n =0] &= {\mathbb{P}}[X_n=0] + {\mathbb{P}}[X_n=1] = {\mathbb{P}}[X_n=0]\\ {\mathbb{P}}[M_n=1] &= {\mathbb{P}}[ X_n=1] + {\mathbb{P}}[X_n=2] = {\mathbb{P}}[X_n=2]\\ {\mathbb{P}}[M_n=2] &= {\mathbb{P}}[ X_n=2] + {\mathbb{P}}[X_n=3] = {\mathbb{P}}[X_n=2]\\ {\mathbb{P}}[M_n=3] &= {\mathbb{P}}[ X_n=3] + {\mathbb{P}}[X_n=4] = {\mathbb{P}}[X_n=4]\\ {\mathbb{P}}[M_n=4] &= {\mathbb{P}}[ X_n=4] + {\mathbb{P}}[X_n=5] = {\mathbb{P}}[X_n=4] \text{ etc.} \end{align}\] In a similar way, for \(n\) odd, we have \[\begin{align} {\mathbb{P}}[ M_n =0] &= {\mathbb{P}}[X_n=0] + {\mathbb{P}}[X_n=1] = {\mathbb{P}}[X_n=1]\\ {\mathbb{P}}[M_n=1] &= {\mathbb{P}}[ X_n=1] + {\mathbb{P}}[X_n=2] = {\mathbb{P}}[X_n=1]\\ {\mathbb{P}}[M_n=2] &= {\mathbb{P}}[ X_n=2] + {\mathbb{P}}[X_n=3] = {\mathbb{P}}[X_n=3]\\ {\mathbb{P}}[M_n=3] &= {\mathbb{P}}[ X_n=3] + {\mathbb{P}}[X_n=4] = {\mathbb{P}}[X_n=3]\\ {\mathbb{P}}[M_n=4] &= {\mathbb{P}}[ X_n=4] + {\mathbb{P}}[X_n=5] = {\mathbb{P}}[X_n=5] \text{ etc.} \end{align}\]
Here is a example of a typical problem where the reflection principle (i.e., the formula for \({\mathbb{P}}[M_n=k]\)) is used:
Let \(X\) be a simple symmetric random walk. What is the probability that \(X_n\leq 0\) for all \(0\leq n \leq T\)?
This is really a question about the maximum, but in disguise. The walk will stay negative or \(0\) if and only if its running maximum \(M_T\) at time \(T\) takes the value \(0\). By our formula for \({\mathbb{P}}[M_n=l]\) we have \[ {\mathbb{P}}[M_T=0] = {\mathbb{P}}[X_T=0] + {\mathbb{P}}[X_T = 1].\] When \(T=2N\) this evaluates to \(\binom{2N}{N} 2^{-2N}\), and when \(T=2N-1\) to \(\binom{2N-1}{N} 2^{-(2N-1)}\).
What is the probability that a simple symmetric random walk will reach the level \(l=1\) in \(T\) steps or fewer? What happens when \(T\to\infty\)?
The first question is exactly the opposite of the question in our previous example, so the answer is \[ 1 - {\mathbb{P}}[M_T=0] = 1- {\mathbb{P}}[X_T=0] - {\mathbb{P}}[X_T=1].\] As above, this evaluates to \(1-\binom{2N}{N} 2^{-2N}\) when \(T=2N\) is even (we skip the case of odd \(T\) because it is very similar). When \(N\to\infty\), we expect \(\binom{2N}{N}\) to go to \(+\infty\) and \(2^{-2N}\) to go to \(0\), so it is not immediately clear which term will win. One way to make a guess is to think about it probabilistically: we are looking at the probability \({\mathbb{P}}[X_{2N}=0]\) that the random walk takes the value \(0\) after exactly \(2N\) steps. Even though no other (single) value is more likely to happen, there are so many other values \(X_{2N}\) could take (anything even from \(-2N\) to \(2N\) except for \(0\)) that we conjecture that its probability converges to \(0\). A formal mathematical argument which proves that our conjecture is, indeed correct, involves Stirling’s formula:
\[ N! \sim \sqrt{2 \pi N} \left( \frac{N}{e} \right)^N \text{ where } A_N \sim B_N \text{ means that } \lim_{N\to\infty} \frac{A_N}{B_N}=1. \]
We write \(\binom{2N}{N} = \tfrac{(2N)!}{N! N!}\) and apply Stirling’s formula to each factorial (let’s skip the details) to conclude that \[ \binom{2N}{N} 2^{-2n}\sim \frac{1}{\sqrt{N \pi}} \text{ so that } \lim_{N\to\infty} \binom{2N}{N} 2^{-2n} = 0 \]
The result of the previous problem implies the following important fact:
The simple symmetric random walk will reach the level \(1\), with certainty, given enough time.
Indeed, we just proved that the probability of this not happening during the first \(T\) steps shrinks down to \(0\) as \(T\to\infty\).
But wait, there is more! By symmetry, the level \(1\) can be replaced by \(-1\). Also, once we hit \(1\), the random walk “renews itself” (this property is called the Strong Markov Property and we will talk about it later), so it will eventually hit the level \(2\), as well. Continuing the same way, we get the following remarkable result
Sooner or later, the symple symmetric random walk will visit any level.
We close this chapter with an application of the reflection principle to a classical problem in probability and combinatorics. Feel free to skip it if you want to.
Suppose that two candidates, Daisy and Oscar, are running for office, and \(T \in{\mathbb{N}}\) voters cast their ballots. Votes are counted the old-fashioned way, namely by the same official, one by one, until all \(T\) of them have been processed. After each ballot is opened, the official records the number of votes each candidate has received so far. At the end, the official announces that Daisy has won by a margin of \(k>0\) votes, i.e., that Daisy got \((T+k)/2\) votes and Oscar the remaining \((T-k)/2\) votes. What is the probability that at no time during the counting has Oscar been in the lead?
We assume that the order in which the official counts the votes is completely independent of the actual votes, and that each voter chooses Daisy with probability \(p\in (0,1)\) and Oscar with probability \(q=1-p\). We don’t know a priori what \(p\) is, and, as it turns out, we don’t need to!
For \(0 \leq n \leq T\), let \(X_n\) be the number of votes received by Daisy minus the number of votes received by Oscar in the first \(n\) ballots. When the \(n+1\)-st vote is counted, \(X_n\) either increases by \(1\) (if the vote was for Daisy), or decreases by 1 otherwise. The votes are independent of each other and \(X_0=0\), so \(X_n\), \(0\leq n \leq T\) is a simple random walk with the time horizon \(T\). The probability of an up-step is \(p\in (0,1)\), so this random walk is not necessarily symmetric. The ballot problem can now be restated as follows:
For a simple random walk \(\{X_n\}_{0\leq n \leq T}\), what is the probability that \(X_n\geq 0\) for all \(n\) with \(0\leq n \leq T\), given that \(X_T=k\)?
The first step towards understanding the solution is the realization that the exact value of \(p\) does not matter. Indeed, we are interested in the conditional probability \({\mathbb{P}}[ F|G]={\mathbb{P}}[F\cap G]/{\mathbb{P}}[G]\), where \(F\) denotes the set of \(\omega\) whose corresponding trajectories always stay non-negative, while the trajectories corresponding to \(\omega\in G\) reach \(k\) at time \(T\). Each \(\omega \in G\) consists of exactly \((T+k)/2\) up-steps (\(1\)s) and \((T-k)/2\) down steps (\(-1\)s), so its probability weight is equal to \(p^{ (T+k)/2} q^{(T-k)/2}\). Therefore, with \(\# A\) denoting the number of elements in the set \(A\), we get \[\begin{aligned} {\mathbb{P}}[ F|G]=\frac{{\mathbb{P}}[F\cap G]}{{\mathbb{P}}[G]}=\frac{\# (F\cap G) \ p^{ (T+k)/2} q^{(T-k)/2}}{ \# G \ p^{ (T+k)/2} q^{(T-k)/2}}=\frac{\#(F\cap G)}{\# G}.\end{aligned}\] This is quite amazing in and of itself. This conditional probability does not depend on \(p\) at all!
Since we already know how to count the number of elements in \(G\) (there are \(\binom{T}{(T+k)/2}\)), “all” that remains to be done is to count the number of elements in \(G\cap F\). The elements in \(G \cap F\) form a portion of all the elements in \(G\) whose trajectories don’t hit the level \(l=-1\); this way, \(\#(G\cap F)=\#G-\#H\), where \(H\) is the set of all paths which finish at \(k\), but cross (or, at least, touch) the level \(l=-1\) in the process. Can we use the reflection principle to find \(\# H\)? Yes, we can. In fact, you can convince yourself that the reflection of any trajectory corresponding to \(\omega \in H\) around the level \(l=-1\) after its last hitting time of that level produces a trajectory that starts at \(0\) and ends at \(-k-2\), and vice versa.
The number of paths from \(0\) to \(-k-2\) is easy to count - it is equal to \(\binom{T}{(T+k)/2+1}\). Putting everything together, we get \[{\mathbb{P}}[ F|G]=\frac{\binom{T}{n_1}-\binom{T}{n_1+1}} {\binom{T}{n_1}}=\frac{k+1}{n_1+1},\text{ where }n_1=\frac{T+k}{2}.\] The last equality follows from the definition of binomial coefficients \(\binom{T}{i}=\frac{T!}{i!(T-i)!}\).
The Ballot problem has a long history (going back to at least 1887) and has spurred a lot of research in combinatorics and probability. In fact, people still write research papers on some of its generalizations. When posed outside the context of probability, it is often phrased as “in how many ways can the counting be performed …” (the difference being only in the normalizing factor \(\binom{T}{n_1}\) appearing in Example above). A special case \(k=0\) seems to be even more popular - the number of \(2n\)-step paths from \(0\) to \(0\) never going below zero is called the \(n\)-th Catalan number and equals \[\begin{align} C_n=\frac{1}{n+1} \binom{2n}{n}. \end{align}\]
Given \(n\in{\mathbb{N}}\), compute \({\mathbb{P}}[ \tau_1 = 2n+1 ]\) for a simple, but possibly biased, random walk. (Note: Clearly, \({\mathbb{P}}[ \tau_1=2n]=0\).)
Let \(A\) denote the set of all trajectories of length \(2n+1\) that hit \(1\) for the first time at time \(2n+1\), and let \(A'\) be the set of all trajectories of length \(2n\) which stay at or below \(0\) at all times and take the value \(0\) at time \(2n\). Clearly, each trajectory in \(A\) is a trajectory in \(A'\) with \(1\) attached at the very end, so that \(\# A = \# A'\).
By the (last part) of the previous problem, \(\# A' = \frac{1}{n+1} \binom{2n}{n}\) (the \(n^{\text{th}}\) Catalan number). As above, all paths in \(A\) have the same probability weight, namely \(p^{n+1} q^n\), so \[ {\mathbb{P}}[ \tau_1 = 2n+1]= p^{n+1} q^n \frac{1}{n+1} \binom{2n}{n}.\]
Given \(p\in (0,1)\),
Using the previous problem, we need to sum the following series \[\sum_{k=0}^{\infty} {\mathbb{P}}[\tau_1=k] = \sum_{n=0}^{\infty} {\mathbb{P}}[ \tau_1 = 2n+1] = \sum_{n=0}^{\infty} p^{n+1} q^{n} \frac{1}{n+1} \binom{2n}{n} = p \sum_{n=0}^{\infty} (pq)^n \frac{1}{n+1} \binom{2n}{n}.\] The sum looks difficult, so let us plot a numerical approximation of its value for different values of the parameter \(p\) (the true value is plotted in orange):
We conjecture that \({\mathbb{P}}[ \tau_1 <\infty ] = 1\) for \(p\geq \tfrac{1}{2}\), but \({\mathbb{P}}[ \tau_1<\infty]<1\) for \(p<\tfrac{1}{2}\). Indeed, using methods beyond the scope of these notes, it can be shown that our conjecture is true and that \[ {\mathbb{P}}[ \tau_1<\infty ] =\begin{cases} 1, & p \geq \tfrac{1}{2}\\ \frac{p}{q}, & p<\tfrac{1}{2}. \end{cases} \]
Since \({\mathbb{P}}[ \tau_1= \infty]>0\) for \(p<\tfrac{1}{2}\), we can immediately conclude that \({\mathbb{E}}[\tau_1]=\infty\) in that case. Therefore, we assume that \(p\geq \tfrac{1}{2}\), and consider the sum \[ {\mathbb{E}}[\tau_1] = \sum_{k=0}^{\infty} k {\mathbb{P}}[\tau_1 = k] = \sum_{n=0}^{\infty} (2n+1) {\mathbb{P}}[ \tau_1 = 2n+1] = \sum_{n=0}^{\infty} p^{n+1} q^{n} \frac{2n+1}{n+1} \binom{2n}{n}.\] We have already seen that (by Stirling’s formula) we have \(\binom{2n}{n} \sim \frac{2^{2n}}{\sqrt{\pi n}}\), so the question reduces to the one about convergence of the following, simpler, series: \[ \sum_{n=1}^{\infty} \frac{1}{\sqrt{n}} p^n q^{n} 2^{2n} = \sum_{n=1}^{\infty} \frac{1}{\sqrt{n}} (4pq)^n.\] When \(p=\tfrac{1}{2}\), we have \(4pq=1\), and the series above becomes a \(p\)-series with \(p=\tfrac{1}{2}\). Hence, it diverges. On the other hand, when \(p>\tfrac{1}{2}\), \(4pq<1\), the terms of the series are dominated by the terms of the convergent geometric series \(\sum_{n=1}^{\infty} (4pq)^n\). Therefore, it, itself, must converge. All in all: \[ {\mathbb{E}}[\tau_1] = \begin{cases} \infty, & p\leq \tfrac{1}{2}, \\ <\infty, & p > \tfrac{1}{2}. \end{cases}. \]
Let \(a_j = {\mathbb{E}}^{j}[\tau_1]\), where \({\mathbb{E}}^{j}\) means that the random walk starts from the level \(j\), i.e., \(X_0=j\), instead of the usual \(X_0=0\). Think about why it is plausible that the following relations hold for the sequence \(a_n\): \[a_1 = 0,\text{ and } a_j = 1 + p a_{j+1} + q a_{j-1}.\] We guess that \(a_j\) has the form \(a_j = c(1-j)\), for \(j<1\) (why?) and plug that guess into the above equation to get: \[ c(1-j) = 1 + p c (-j) + q c (2-j) = 1 - c - 2 c q + c(1-j).\] It follows that \(c = \tfrac{1}{1-2q} = \tfrac{1}{p-q}\). Thus, if you believe the heuristic, we have \[ {\mathbb{E}}[ \tau_1 ] = \begin{cases} \frac{1}{p-q}, & p>\tfrac{1}{2}, \\ + \infty, & p\leq \tfrac{1}{2}. \end{cases}\] (Note: If you have never seen it before, the approach we took here seems very unusual. Indeed, in order to find the value of \(a_0\) we decided to compute values for the elements of the whole sequence \(a_n\). This kind of thinking will appear many times later in the chapters on Markov Chains.)
A random time is simply a random variable which takes values in the set \({\mathbb{N}}_0\) - it is random, and it can be interpreted as a point in time. Not all random times are created equal, though: here are three examples based on a simple symmetric random walk \(X\):
\(\tau = 3\). This is the simplest random time - it always takes the value \(3\), no matter what. It is random only in the formal sense of the word (just as the constant random vairbale \(X=3\) is a random variable, but not a very interesting one). Constant random times, like \(\tau=3\), are called deterministic times.
\(\tau=\tau_1\) where \(\tau_1\) is the first time \(X\) hits the level \(1\). It is no longer constant - it clearly depends on the underlying trajectory of the random walk: sometimes \(\tau_1=1\); other times it can be very large.
\(\tau=\tau_{\max}\) where \(\tau_{\max}\) is the first time \(X\) takes its maximal value in the interval \(\{0,1,\dots, 100\}\). The random time \(\tau_{\max}\) is clearly non-constant, but it differs from \(\tau=3\) or \(\tau=\tau_1\) in a significant way.
Indeed, the first two examples have the following property:
Given a time \(n\), you can tell whether \(\tau=n\) or not using only the information you have gathered by time \(n\).
The third one does not. Random times with this property are called stopping times. Here is a more precise, mathematical, definition. You should note that we allow our stopping times to take the value \(+\infty\). The usual interpretation is that whatever the stopping time is modeling never happens.
Definition. A random variable \(\tau\) taking values in \({\mathbb{N}}_0\cup\{+\infty\} = \{0,1,2,\dots, +\infty\}\) is said to be a stopping time with respect to the process \(\{X_n\}_{n\in {\mathbb{N}}_0}\) if for each \(n\in{\mathbb{N}}_0\) there exists a function \(G^n:{\mathbb{R}}^{n+1}\to \{0,1\}\) such that \[\mathbf{1}_{\{\tau=n\}}=G^n(X_0,X_1,\dots, X_n), \text{ for all } n\in{\mathbb{N}}_0.\]
The functions \(G^n\) are called the decision functions, and should be thought of as a black box which takes the values of the process \(\{X_n\}_{n\in {\mathbb{N}}_0}\) observed up to the present point and outputs either \(0\) or \(1\). The value \(0\) means keep going and \(1\) means stop. The whole point is that the decision has to be based only on the available observations and not on the future ones.
Alternatively, you can think of a stopping time as an R function whose input is a vector which represents a trajectory \(\omega\) of a random walk (or any other process) and the output is a nonnegative integer. This function needs to be such that if it “decides” to output the value \(k\), it had to have based its decision only on the first \(k\) components of \(\omega\). This means that if the output corresponding to the input trajectory \(\omega\) is \(k\), and \(\omega'\) is another trajectory whose first components match those of \(\omega\), then the output corresponding to \(\omega\)’ must also be \(k\).
Now that we know how to spot stopping times, let’s list some examples:
The simplest examples of stopping times are (non-random) deterministic times. Just set \(\tau=5\) (or \(\tau=723\) or \(\tau=n_0\) for any \(n_0\in{\mathbb{N}}_0\cup\{+\infty\}\)), no matter what the state of the world \(\omega\in\Omega\) is. The family of decision rules is easy to construct: \[G^n(x_0,x_1,\dots, x_n)=\begin{cases} 1,& n=n_0, \\ 0, & n\not= n_0.\end{cases}.\] Decision functions \(G^n\) do not depend on the values of \(X_0,X_1,\dots, X_n\) at all. A gambler who stops gambling after 20 games, no matter what the winnings or losses are uses such a rule.
Probably the most well-known examples of stopping times are (first) hitting times. They can be defined for general stochastic processes, but we will stick to simple random walks for the purposes of this example. So, let \(X_n=\sum_{k=0}^n \delta_k\) be a simple random walk, and let \(\tau_l\) be the first time \(X\) hits the level \(l\in{\mathbb{N}}\). More precisely, we use the following slightly non-intuitive but mathematically correct definition \[\tau_l=\min \{ n\in{\mathbb{N}}_0\, : \, X_n=l\}.\] The set \( \{ n\in{\mathbb{N}}_0\, : \, X_n=l\}\) is the collection of all time-points at which \(X\) visits the level \(l\). The earliest one - the minimum of that set - is the first hitting time of \(l\). In states of the world \(\omega\in\Omega\) in which the level \(l\) just never gets reached, i.e., when \( \{ n\in{\mathbb{N}}_0\, : \, X_n=l\}\) is an empty set, we set \(\tau_l(\omega)=+\infty\).
In order to show that \(\tau_l\) is indeed a stopping time, we need to construct the decision functions \(G^n\), \(n\in{\mathbb{N}}_0\). Let us start with \(n=0\). We would have \(\tau_l=0\) only in the (impossible) case \(X_0=l\), so we always have \(G^0(X_0)=0\). How about \(n\in{\mathbb{N}}\). For the value of \(\tau_l\) to be equal to exactly \(n\), two things must happen:
\(X_n=l\) (the level \(l\) must actually be hit at time \(n\)), and
\(X_{n-1}\not = l\), \(X_{n-2}\not= l\), …, \(X_{1}\not=l\), \(X_0\not=l\) (the level \(l\) has not been hit before).
Therefore, \[G^n(x_0,x_1,\dots, x_n)=\begin{cases} 1,& x_0\not=l, x_1\not= l, \dots, x_{n-1}\not=l, x_n=l\\ 0,&\text{otherwise}. \end{cases}\] The hitting time \(\tau_2\) of the level \(l=2\) for a particular trajectory of a symmetric simple random walk is depicted below:
How about something that is not a stopping time? Let \(T\in{\mathbb{N}}\) be an arbitrary time-horizon and let \(\tau_{\max}\) be the last time during \(0,\dots, T\) that the random walk visits its maximum during \(0,\dots, T\):
If you bought a share of a stock at time \(n=0\), had to sell it some time before or at \(T\) and had the ability to predict the future, this is one of the points you would choose to sell it at. Of course, it is impossible in general to decide whether \(\tau_{\max}=n\), for some \(n\in0,\dots, T-1\) without the knowledge of the values of the random walk after \(n\).
More precisely, let us sketch the proof of the fact that \(\tau_{\max}\) is not a stopping time. Suppose, to the contrary, that it is, and let \(G^n\) be the associated family of decision functions. Consider the following two trajectories: \((0,1,2,3,\dots, T-1,T)\) and \((0,1,2,3,\dots, T-1,T-2)\). They differ only in the direction of the last step. They also differ in the fact that \(\tau_{\max}=T\) for the first one and \(\tau_{\max}=T-1\) for the second one. On the other hand, by the definition of the decision functions, we have \[\mathbf{1}_{\{\tau_{\max}=T-1\}}=G^{T-1}(X_0,\dots, X_{T-1}).\] The right-hand side is equal for both trajectories, while the left-hand side equals to \(0\) for the first one and \(1\) for the second one. A contradiction.
One of the superpowers of stopping times is that they often behave just like deterministic times. The best way to understand this statement is in the context of the beautiful martingale theory. Unfortunately, learning about martingales would take an entire semester, so we have to settle for an illustrative example, namely, Wald’s identity.
Let \(\{\xi_n\}_{n\in{\mathbb{N}}}\) be a sequence of independent and identically distributed random variables. The example you should keep in mind is \(\xi_n = \delta_n\), where \(\delta_n\) are coin tosses in the definition of a random walk. We set \(X_n = \sum_{k=1}^n \xi_k\) and note that it is easy to compute \({\mathbb{E}}[X_n]\): \[ {\mathbb{E}}[ X_n ] = {\mathbb{E}}[ \xi_1+\dots + \xi_n] = {\mathbb{E}}[\xi_1] + \dots + {\mathbb{E}}[\xi_n] = n \mu, \text{ where } \mu = {\mathbb{E}}[\xi_1]={\mathbb{E}}[\xi_2]=\dots\] provided \({\mathbb{E}}[\xi_1]\) exists. The expected value \(\mu\) is the same for all \(\xi_1,\xi_2,\dots\) because they all have the same distribution. In words, the equality above tells us that the expected value of \(X\) moves with speed \(\mu\). Wald’s identity tells us that the same thing is true when the deterministic time \(n\) is replaced by a stopping time. To understand its statement below, we must first introduce a bit more notation. Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a stochastic process, and let \(\tau\) be a random time which never takes the value \(+\infty\). Remember that \(X_0, X_1, \dots\) are random variables, i.e., functions of the elementary outcome \(\omega\in\Omega\). The same is true for \(\tau\). Therefore, in order to define the random variable \(X_{\tau}\) we need to specify what its value is for any given \(\omega\): \[ X_{\tau} (\omega) = X_{n}(\omega) \text{ where } n=\tau(\omega).\] This is exactly what you would expect; the elementary outcome \(\omega\) not only tells us which trajectory of the process to consider, but also the time at which to do it. Note that when \(\tau=n\) is a deterministic time, \(X_{\tau}\) is exactly \(X_n\).
Theorem. (Wald’s identity) Let \(\{\xi_n\}_{n\in{\mathbb{N}}}\) be a sequence of independent and identically distributed random variables, and let \(X_n = \sum_{k=1}^n \xi_k\) be the associated random walk. If \({\mathbb{E}}[ |\xi_n|]<\infty\) and \(\tau\) is a stopping time for \(\{X_n\}_{n\in {\mathbb{N}}_0}\) such that \({\mathbb{E}}[\tau]<\infty\), then \[ {\mathbb{E}}[X_{\tau}] = {\mathbb{E}}[\tau] \mu \text{ where } \mu = {\mathbb{E}}[\xi_1] = {\mathbb{E}}[\xi_2] = \dots \]
Before we prove this theorem, here is a handy identity:
(The “tail formula” for the expectation) Let \(\tau\) be an \({\mathbb{N}}_0\)-valued random variable. Show that \[{\mathbb{E}}[\tau]=\sum_{k=1}^{\infty} {\mathbb{P}}[\tau \geq k].\]
Clearly, \({\mathbb{P}}[\tau\geq k] = {\mathbb{P}}[ \tau=k] + {\mathbb{P}}[\tau=k+1]+\dots\). Therefore,
\[ \begin{array}{cccccccc} \sum_{k=1}^{\infty} {\mathbb{P}}[\tau \geq k] &=& {\mathbb{P}}[ \tau=1] &+& {\mathbb{P}}[\tau=2] &+& {\mathbb{P}}[\tau=3] &+& \dots \\ && &+& {\mathbb{P}}[\tau=2] &+& {\mathbb{P}}[\tau=3] &+& \dots \\ && && &+& {\mathbb{P}}[\tau=3] &+& \dots \\ && && && &+& \dots \end{array} \] If you look at the “columns”, you will realize that the expression \({\mathbb{P}}[\tau=1]\) appears in this sum once, \({\mathbb{P}}[\tau=2]\) twice, \({\mathbb{P}}[\tau=3]\) three times, etc. Hence \[\sum_{k=1}^{\infty} {\mathbb{P}}[ \tau\geq k] = \sum_{n=1}^{\infty} n {\mathbb{P}}[\tau=n] = {\mathbb{E}}[\tau].\]
Prove Wald’s identity.
Here is another representation of the random variable \(X_{\tau}\): \[X_{\tau} = \sum_{k=1}^{\tau} \xi_k=\sum_{k=1}^{\infty} \xi_k \mathbf{1}_{\{k\leq \tau\}}.\] The idea behind it is simple: add all the values of \(\xi_k\) for \(k\leq \tau\) and keep adding zeros (since \(\xi_k \mathbf{1}_{\{k\leq \tau\}}=0\) for \(k>\tau\)) after that. Taking expectation of both sides and switching \({\mathbb{E}}\) and \(\sum\) (this can be justified, but the argument is technical and we omit it here) yields: \[ {\mathbb{E}}[\sum_{k=1}^{\tau} \xi_k]=\sum_{k=1}^{\infty} {\mathbb{E}}[ \mathbf{1}_{\{k\leq \tau\}}\xi_k]. \] Let us examine the term \({\mathbb{E}}[\xi_k\mathbf{1}_{\{k\leq \tau\}}]\) in some detail. We first note that \[\mathbf{1}_{\{k\leq \tau\}}=1-\mathbf{1}_{\{k>\tau\}}=1-\mathbf{1}_{\{k-1\geq \tau\}}=1-\sum_{j=0}^{k-1}\mathbf{1}_{\{\tau=j\}},\] so that \[ {\mathbb{E}}[\xi_k \mathbf{1}_{\{k\leq \tau\}}]={\mathbb{E}}[\xi_k]-\sum_{j=0}^{k-1}{\mathbb{E}}[ \xi_k \mathbf{1}_{\{\tau=j\}} ].\] By the assumption that \(\tau\) is a stopping time, the indicator \(\mathbf{1}_{\{\tau=j\}}\) can be represented as \(\mathbf{1}_{\{\tau=j\}}=G^j(X_0,\dots, X_j)\), and, because each \(X_i\) is just a sum of the increments \(\xi_1, \dots, \xi_i\), we can actually write \(\mathbf{1}_{\{\tau=j\}}\) as a function of \(\xi_1,\dots, \xi_j\) only: \(\mathbf{1}_{\{\tau=j\}}=H^j(\xi_1,\dots, \xi_j).\) By the independence of \((\xi_1,\dots, \xi_j)\) from \(\xi_k\) (because \(j<k\)) we have \[\begin{align} {\mathbb{E}}[\xi_k \mathbf{1}_{\{\tau=j\}}]&={\mathbb{E}}[ \xi_k H^j(\xi_1,\dots, \xi_j)]= {\mathbb{E}}[\xi_k] {\mathbb{E}}[ H^j(\xi_1,\dots, \xi_j)]={\mathbb{E}}[\xi_k] {\mathbb{E}}[\mathbf{1}_{\{\tau=j\}}]= {\mathbb{E}}[\xi_k]{\mathbb{P}}[T=j]. \end{align}\] Therefore, \[\begin{align} {\mathbb{E}}[\xi_k \mathbf{1}_{\{k\leq \tau\}}]&={\mathbb{E}}[\xi_k]-\sum_{j=0}^{k-1} {\mathbb{E}}[\xi_k] {\mathbb{P}}[\tau=j]={\mathbb{E}}[\xi_k] {\mathbb{P}}[\tau\geq k] =\mu {\mathbb{P}}[\tau\geq k], \end{align}\] where the last equality follows from the fact that all \(\xi_k\) have the same expectation, namely \(\mu\).
Putting it all together, we get \[\begin{align} {\mathbb{E}}[X_{\tau}]&={\mathbb{E}}[\sum_{k=1}^{\tau} \xi_k]=\sum_{k=1}^{\infty} \mu {\mathbb{P}}[\tau\geq k]=\mu \sum_{k=1}^{\infty} {\mathbb{P}}[\tau\geq k]= {\mathbb{E}}[\tau] \mu, \end{align}\] where we use the “tail formula” to get the last equality.
Show, by giving an example, that Wald’s identity does not necessarily hold if \(\tau\) is not a stopping time.
Let \(X\) be a simple symmetric random walk, and let \(\tau\) be a random time constructed like this: \[\begin{align} \tau = \begin{cases} 1, & X_1=1 \\ 0,& X_1=-1. \end{cases} \end{align}\] Then, \[\begin{align} X_{\tau} = \begin{cases} X_1, & X_1=1 \\ X_0, & X_1=-1, \end{cases} = \begin{cases} 1, & X_1=1 \\ 0,& X_1=-1. \end{cases} \end{align}\] and, therefore, \({\mathbb{E}}[ X_{\tau}] = 1 \cdot 1/2 + 0 \cdot 1/2 = 1/2\). On the other hand \(\mu={\mathbb{E}}[\xi_1]=0\) and \({\mathbb{E}}[\tau] = 1/2\), so \(1/2 = {\mathbb{E}}[X_{\tau}] \ne {\mathbb{E}}[\tau] \mu = 0\).
It is clear that \(\tau\) cannot be a stopping time, since Wald’s identity would hold for it if it were. To see that it is not more directly, consider the event when \(\tau=0\). Its occurrence depends on whether \(X_1=1\) or not, which is not known at time \(0\).
A famous use of Wald’s identity is in the solution of the following classical problem:
A gambler starts with \(\$x\) dollars and repeatedly plays a game in which she wins a dollar with probability \(\tfrac{1}{2}\) and loses a dollar with probability \(\tfrac{1}{2}\). She decides to stop when one of the following two things happens:
she goes bankrupt, i.e., her wealth hits \(0\), or
she makes enough money, i.e., her wealth reaches some predetermined level \(a>x\).
The “Gambler’s ruin” problem (dating at least to 1600s) asks the following question: what is the probability that the gambler will make \(a\) dollars before she goes bankrupt?
Let the gambler’s “wealth” \(\{W_n\}_{n\in {\mathbb{N}}_0}\) be modeled by a simple random walk starting from \(x\), whose increments \(\xi_k=\delta_k\) are coin-tosses. Then \(W_n=x+X_n\), where \(X_n = \sum_{k=1}^n \xi_k\) is a SSRW. Let \(\tau\) be the time the gambler stops. We can represent \(\tau\) in two different (but equivalent) ways. On the one hand, we can think of \(T\) as the smaller of the two hitting times \(\tau_{-x}\) and \(\tau_{a-x}\) of the levels \(-x\) and \(a-x\) for the random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\) (remember that \(W_n=x+X_n\), so these two correspond to the hitting times for the process \(\{W_n\}_{n\in {\mathbb{N}}_0}\) of the levels \(0\) and \(a\)). On the other hand, we can think of \(\tau\) as the first hitting time of the two-element set \(\{-x,a-x\}\) for the process \(\{X_n\}_{n\in {\mathbb{N}}_0}\). In either case, it is quite clear that \(\tau\) is a stopping time (can you write down the decision functions?).
When we talked about the maximum of the simple symmetric random walk, we proved that it hits any value if given enough time. Therefore, the probability that the gambler’s wealth will remain strictly between \(0\) and \(a\) forever is zero. So, \({\mathbb{P}}[T<\infty]=1\).
What can we say about the random variable \(X_{\tau}\) - the gambler’s wealth (minus \(x\)) at the random time \(\tau\)? Clearly, it is either equal to \(-x\) or to \(a-x\), and the probabilities \(p_0\) and \(p_a\) with which it takes these values are exactly what we are after in this problem. We know that, since there are no other values \(X_{\tau}\) can take, we must have \(p_0+p_a=1\). Wald’s identity gives us another equation for \(p_0\) and \(p_a\): \[{\mathbb{E}}[X_{\tau}]={\mathbb{E}}[\xi_1] {\mathbb{E}}[\tau]=0\cdot {\mathbb{E}}[\tau]=0 \text{ so that } 0 = {\mathbb{E}}[X_{\tau}]=p_0 (-x)+p_a (a-x).\]
We now have a system of two linear equations with two unknowns, and solving it yields \[p_0= \frac{a-x}{a}, \ p_a=\frac{x}{a}.\] It is remarkable that the two probabilities are proportional to the amounts of money the gambler needs to make (lose) in the two outcomes. The situation is different when \(p\not=\tfrac{1}{2}\).
In order to be able to use Wald’s identity, we need to check its conditions. We have already seen that \(\tau\) needs to be a stopping time, and not just any old random time. There are also two conditions about the expected values of \(\tau\) and of \(\xi_1\). If you read the above solution carefully, you will realize that we never checked whether \({\mathbb{E}}[\tau]<\infty\). We should have, but we did not because we still don’t have the mathematical tools to do it. We will see later that, indeed, \({\mathbb{E}}[\tau]<\infty\) for this particular stopping time. In general, the condition that \({\mathbb{E}}[\tau]<\infty\) is important, as the following simple example shows:
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk, and let \(\tau_1\) be the first hitting time of the level \(1\). Use Wald’s identity to show that \({\mathbb{E}}[\tau]=+\infty\).
Suppose, to the contrary, that \({\mathbb{E}}[\tau]<\infty\). Since \({\mathbb{E}}[\delta_1]<\infty\) and \(\tau_1\) is a stopping time, Wald’s identity applies: \[ {\mathbb{E}}[X_{\tau_1}] = {\mathbb{E}}[ \delta_1] \cdot {\mathbb{E}}[\tau_1].\] The right hand side is then equal to \(0\) because \({\mathbb{E}}[\delta_1]=0\). On the other hand, \(X_{\tau_1}=1\): the value of \(X_n\) when it first hits the level \(1\) is, of course, \(1\). This leads to a contradiction \(1={\mathbb{E}}[X_{\tau_1}] = {\mathbb{E}}[\delta_1] {\mathbb{E}}[\tau_1] = 0\). Therefore, our initial assumption that \({\mathbb{E}}[\tau_1]<\infty\) was wrong!
We close this chapter with another identity of Abraham Wald, namely ``Wald’s second identity’’. The original identity helped us compute the expected value of the position of a random walk at a stopping time. The second one computes the variance:
Theorem. (Wald’s second identity) Let \(\{\xi_n\}_{n\in{\mathbb{N}}}\) be a sequence of independent and identically distributed random variables, and let \(X_n = \sum_{k=1}^n \xi_k\) be the associated random walk. If \({\mathbb{E}}[ (\xi_n)^2]<\infty\) and \(\tau\) is a stopping time for \(\{X_n\}_{n\in {\mathbb{N}}_0}\) such that \({\mathbb{E}}[\tau]<\infty\), then \[ \operatorname{Var}[X_{\tau}]= {\mathbb{E}}[\tau] \sigma^2 \text{ where } \sigma^2 = \operatorname{Var}[\xi_1] = \operatorname{Var}[\xi_2] = \dots \]
The proof is similar (but more difficult) than the proof of Wald’s (first) identity, so we skip it.
Let \(\{X_n\}_{n\in {\mathbb N}_0}\) be a simple symmetric random walk.
Compute \({\mathbb{P}}[ X_1 + X_2 + X_3 > 0]\).
Compute \({\mathbb{P}}[ X_3 = 1, X_9 = 1 \text{ and } X_{15} = 3]\).
Compute \({\mathbb{P}}[ X_n\geq -2 \text{ for all $n\leq 10$}]\).
Find the distribution (table) of the product \(X_1 X_3\).
There are \(8\) possible trajectories \((0,x_1,x_2,x_3)\) of length \(3\) that a random walk can take. Out of the \(8\), only the following three have \(x_1+x_2+x_3>0\): \[\begin{align} (0,1,2,3), (0,1,2,1) \text{ and } (0,1,0,1) \end{align}\] Since the walk is symmetric, each of those has probability \(1/8\), so \({\mathbb{P}}[ X_1+X_2+X_3>0] = 3/8\).
Since the increments of the random walk are independent, we have \[\begin{align} {\mathbb{P}}[ X_3 = 1, X_9=1 \text{ and } X_{15} = 3] & = {\mathbb{P}}[ X_1 = 1, \, X_9-X_1 = 0,\, X_{15} - X_9 = 2]\\ &= {\mathbb{P}}[ X_3 = 1]\times {\mathbb{P}}[ X_9 - X_3 =0 ] \times {\mathbb{P}}[ X_{15} - X_9 = 2] \end{align}\] Moreover, we have \({\mathbb{P}}[X_9 - X_3=0] = {\mathbb{P}}[ X_6 - X_0=0] = {\mathbb{P}}[ X_6 = 0]\). Similarly \({\mathbb{P}}[ X_{15} - X_9 = 2] = {\mathbb{P}}[ X_6 = 2]\). It remains to use the formula for probabilities of the form \({\mathbb{P}}[ X_n = k]\) from the notes to obtain
\[\begin{align} {\mathbb{P}}[ X_3 = 1, X_9=1 \text{ and } X_{15} = 3] & = \binom{3}{2} 2^{-3} \times \binom{6}{3} 2^{-6} \times \binom{6}{4} 2^{-6}= 2^{-15} \binom{6}{4} \binom{6}{3} \end{align}\]
By symmetry, this is the same as \({\mathbb{P}}[ X_n \leq 2, 0\leq n \leq 10] = {\mathbb{P}}[ M_{10} \leq 2]\), where \(M\) is the running-maximum process. By (the formula derived from) the reflection principle, we have \[\begin{align} {\mathbb{P}}[ M_n \leq 2] &= {\mathbb{P}}[ M_{10} = 2]+{\mathbb{P}}[M_{10} = 1] + {\mathbb{P}}[ M_{10} = 0] =({\mathbb{P}}[ X_{10} = 2] + {\mathbb{P}}[X_{10}=3]) +({\mathbb{P}}[X_{10} = 1] + {\mathbb{P}}[X_{10} = 2]) + ({\mathbb{P}}[ X_{10}=0] + {\mathbb{P}}[ X_{10}=1]) \\ &= {\mathbb{P}}[ X_{10} = 0] + 2 {\mathbb{P}}[ X_{10} = 2] = \binom{10}{5} 2^{-10} + 2 \times \binom{10}{6} 2^{-10}. \end{align}\]
There are several ways of solving this problem. The simplest one would be to list all \(8\) trajectories of the random walk of length \(3\) compute the value of \(X_1 \times X_3\) on each of them:
|
trajectory |
value |
|---|---|
|
(0,1,2,3) |
3 |
|
(0,1,2,1) |
1 |
|
(0,1,0,1) |
1 |
|
(0,1,0,-1) |
-1 |
|
(0,-1,0,1) |
-1 |
|
(0,-1,0,-1) |
1 |
|
(0,-1,-2,-1) |
1 |
|
(0,-1,-2,-3) |
3 |
Since each trajectory has probability \(1/8\), counting the number of times each of the possible values \(-1\), \(1\) or \(3\) appears gives us the distribution of \(X_1 X_3\):
|
-1 |
1 |
3 |
|---|---|---|
|
0.25 |
0.5 |
0.25 |
Alternatively, we can write \(X_1\) and \(X_3\) as a sum of independent coin tosses and obtain \[\begin{align} X_1 X_3 = X_1 (X_1+ X_3 - X_1) = X_1^2 + X_1(X_3 - X_1) = 1 + X_1 (X_3 - X_1). \end{align}\] The random variable \(X_3 - X_1\) has the same distribution as \(X_2\). It is, also, independent of \(X_1\), so multiplying it by \(X_1\) is equivalent to switching its sign with probability \(1/2\), independently of its value. But \(X_2\) is symmetric so this independent sign switch does not affect its distribution. Hence \(X_1(X_3-X_1)\) has the same distribution as \(X_2\), and since \(X_2\) takes values \(-2,0\) and \(2\) with probabilities \(0.25, 0.5\) and \(0.25\), we arrive quickly at the distribution table above.
Let \(\{X_n\}_{0\leq n \leq 10}\) be a simple symmetric random walk with time horizon \(T=10\). What is the probability it will never reach the level \(5\)?
A fair coin is tossed repeatedly, with the first toss resulting in \(H\) (i.e., heads). After that, each time the outcome of the coin matches the previous outcome, the player gets a dollar. If the two do not match, the player has to pay a dollar. The player stops playing once she “earns” \(10\) dollars. What is the probability that she will need at least 20 tosses (including the first one) to achieve that?
A fair coin is tossed repeatedly and the record of the outcomes is kept. Tossing stops the moment the total number of heads obtained so far exceeds the total number of tails by 3. For example, a possible sequence of tosses could look like HHTTTHHTHHTHH. What is the probability that the length of such a sequence is at most 10?
\[\begin{aligned} {\mathbb{P}}[ M_{10}\leq 4 ] &= {\mathbb{P}}[ M_{10}=0] + {\mathbb{P}}[ M_{10}=1] + {\mathbb{P}}[ M_{10} = 2] + {\mathbb{P}}[M_{10} = 3] + {\mathbb{P}}[ M_{10} = 4] \\ & = ({\mathbb{P}}[ X_{10} = 0] + {\mathbb{P}}[ X_{10} = 1] ) + ({\mathbb{P}}[ X_{10} = 1] + {\mathbb{P}}[ X_{10} = 2] ) \\ & + ({\mathbb{P}}[ X_{10} = 2] + {\mathbb{P}}[ X_{10} = 3] ) + ({\mathbb{P}}[ X_{10} = 3] + {\mathbb{P}}[ X_{10} = 4] )\\ & + ({\mathbb{P}}[ X_{10} = 4] + {\mathbb{P}}[ X_{10} = 5] ) \\ &= 2 ({\mathbb{P}}[ X_{10}=4] + {\mathbb{P}}[X_{10} = 2]) + {\mathbb{P}}[X_{10} =0] \\&= 2^{-10}( 2 \binom{10}{7} + 2 \binom{10}{6} + \binom{10}{5}) \end{aligned}\]
Let the outcomes of the coin tosses be denoted by \(\gamma_1 = H\), \(\gamma_2, \gamma_3, \dots\). We define the random variables \(\delta_1,\delta_2,\dots\) as follows: \(\delta_1 = 1\) if \(\gamma_2 = T\) and \(\delta_1 = -1\), otherwise. Similarly, \(\delta_2 = 1\) if \(\gamma_3 = \gamma_2\) and \(-1\) otherwise. It is clear that \(\delta_1,\delta_2,\dots\) is an iid sequence of coin tosses (just like in the definition of) of a simple symmetric random walk. After \(n\) tosses ( the first one), our gambler has \(X_n = \delta_1+\delta_2 + \dots + \delta_n\) dollars. She will need at least 19 tosses (excluding the first one) to reach \(10\) dollars if and only if the value of the running maximum process at time \(n=18\) is at most \(9\). Using the formula from the formula sheet, this evaluates to \[\begin{aligned} {\mathbb{P}}[ M_{18}\leq 9] &= \sum_{k=0}^{9} {\mathbb{P}}[ M_{18} = k] = \sum_{k=0}^{9} ({\mathbb{P}}[X_{18}=k] + {\mathbb{P}}[ X_{18} = k+1])\\ & = {\mathbb{P}}[ X_{18} =0 ] + 2\, {\mathbb{P}}[X_{18} = 2] + 2\, {\mathbb{P}}[X_{18} = 4] + \dots \\ & \qquad \dots + 2\, {\mathbb{P}}[ X_{18} = 8] + {\mathbb{P}}[ X_{18} = 10] \\ &= 2^{-18}\left( \binom{18}{9} + 2 \binom{18}{10} + 2\binom{18}{11} + 2 \binom{18}{12} + 2 \binom{18}{13} + \binom{18}{14}\right) \end{aligned}\] Btw, you could have gotten a seemingly different answer. Since it is impossible to reach \(10\) in exactly \(19\) steps (the parity is wrong), the required probability is also equal to \[\begin{align} {\mathbb{P}}[ M_{19}\leq 9] &= \sum_{k=0}^9 \Big( {\mathbb{P}}[ X_{19} = k] + {\mathbb{P}}[ X_{19} = k+1] \Big) = 2^{-19} \times 2 \times \sum_{k=1}^9 \binom{19}{(19+k)/2}\\ &= 2^{-18} \times \left( \binom{19}{10} + \binom{19}{11} + \dots + \binom{19}{16} \right). \end{align}\]
Let \(X_n\), \(n\in{\mathbb{N}}_0\) be the number of heads minus the number of tails obtained so far. Then, \(\{X_n\}_{n\in {\mathbb{N}}_0}\) is a simple symmetric random walk, and we stop tossing the coin when \(X\) hits \(3\) for the first time. This will happen during the first 10 tosses, if and only if \(M_{10} \geq 3\), where \(M_n\) denotes the (running) maximum of \(X\). According to the reflection principle, \[\nonumber \begin{split} {\mathbb{P}}[M_{10} \geq 3]&= {\mathbb{P}}[ X_{10} \geq 3 ] + {\mathbb{P}}[ X_{10} \geq 4]\\ & = 2( {\mathbb{P}}[X_{10}= 4] +{\mathbb{P}}[X_{10}= 6] +{\mathbb{P}}[X_{10}= 8] +{\mathbb{P}}[X_{10}= 10])\\ &= 2^{-9} \left[ \binom{10}{3}+\binom{10}{2}+\binom{10}{1}+\binom{10}{0} \right] = {\frac{11}{32}}. \end{split}\]
Luke starts a random walk, where each step takes him to the left or to the right, with the two alternatives being equally likely and independent of the previous steps. \(11\) steps to his right is a cookie jar, and Luke gets to take a (single) cookie every time he reaches that position. He performs exactly \(15\) steps, and then stops.
What is the probability that Luke will be exactly by the cookie jar when he stops?
What is the probability that Luke stops with with exactly \(3\) cookies in his hand?
What is the probability that Luke stops with at least one cookie in his hand?
Suppose now that we place a bowl of broccoli soup one step to the right of the cookie jar. It smells so bad that, if reached, Luke will throw away all the cookies he is currently carrying (if any) and run away pinching his nose. What is the probability that Luke will finish his \(15\)-step walk without ever encountering the yucky bowl of broccoli soup and with at least one cookie in his hand?
Let the position at time \(n\) be denoted by \(X_n\), so that \(\{X_n\}_{0\leq n \leq 15}\) is a simple symmetric random walk with the time horizon \(T=15\).
This is simply \({\mathbb{P}}[ X_{15} = 11] = \binom{15}{2} 2^{-15} = \binom{15}{13} 2^{-15}\).
The only way for Luke to return with \(3\) cookies is to go straight to \(11\), step away from it, return, step away from it and return again. There are exactly \(4\) paths that do that. They all start with \(11\) \(+1\)s (or "up"s or "right"s) and then continue in one of the following 4 ways \[(+1,-1,+1,-1), (+1,-1,-1,+1), (-1,+1,-1,+1) \text{ and } (-1,+1,+1,-1).\] Therefore, the probability is \(4/2^{15} = 2^{-13}\).
Luke will stop with at least one cookie in his hand, if and only if the maximal (i.e., right-most) position of his walk is \(11\) or above. Therefore, the required probability is \({\mathbb{P}}[ M_{15} \geq 11]\). Using the formula \({\mathbb{P}}[ M_T=k] = {\mathbb{P}}[X_T = k] + {\mathbb{P}}[ X_T = k+1]\) we get \[\begin{aligned} {\mathbb{P}}[ M_{15} \geq 11] &= {\mathbb{P}}[ M_{15} = 11] + {\mathbb{P}}[ M_{15} = 12] + \dots + {\mathbb{P}}[ M_{15}=15]\\ & = {\mathbb{P}}[ X_{15}=11] + 2 {\mathbb{P}}[ X_{15} = 13] + 2 {\mathbb{P}}[X_{15}=15] \\ &= 2^{-15}\Big( \binom{15}{2} + 2 \binom{15}{1} + 2\binom{15}{0}\Big). \end{aligned}\]
Here, we want Luke to reach the position \(11\) (to get a cookie), but not the position \(12\) (where the bowl of broccoli soup is). This corresponds to the maximum being exactly \(11\). By the formula \({\mathbb{P}}[ M_T = k] = {\mathbb{P}}[X_T=k] + {\mathbb{P}}[X_T = k+1]\), we get \[{\mathbb{P}}[ M_{15}=11] = {\mathbb{P}}[X_{15}=11] = \binom{15}{2}2^{-15}.\]
Let \(C_n = \frac{1}{n+1}\binom{2n}{n}\) denote the \(n\)-th Catalan number, as defined at the end of the discussion of the Balot problem above.
Use the reflection principle to show that \(C_n\) is the number trajectories \((x_0,\dots, x_{2n})\) of a random walk with time horizon \(T=2n\) such that \(x_k \geq 0\), for all \(k\in\{0,1,\dots, 2n\}\) and \(x_{2n}=0\).
Prove the Segner’s recurrence formula \(C_{n+1} = \sum_{i=0}^n C_{i} C_{n-i}\). .
Show that \(C_n\) is the number of ways the vertices of a regular \(2n\)-gon can be paired so that the line segments joining paired vertices do not intersect.
Let \(\hat{C}_n\) be the number of all with \(x_0=0\) and \(x_{2n}\) such that \(x_k<0\) for some \(k\). This way, \(C_n = \binom{2n}{n} - \hat{C}_n\), since the total number of trajectories with \(x_0=x_{2n}=0\) is \(\binom{2n}{n}\). If we reflect each trajectory that contributes to \(\hat{C}_n\) around the level \(-1\), we will get a trajectory with \(x_{2n} = -2\). Conversely, each trajectory from \(x_0=\) to \(x_{2n}=-2\) will cross the level \(-1\) at some point \(k\), and its reflection (around \(-1\)) will be a trajectory with \(x_0=0\) to \(x_{2n}=0\). Therefore, \(\hat{C}_n = \binom{2n}{n+1}\) because there are exactly \(\binom{2n}{n+1}\) trajectories with \(x_0=0\) and \(x_{2n}=-2\) (\(n+1\) down-steps and \(2n - (n+1) = n-1\) up-steps). Putting it all together, we get \[\begin{align} C_n &= \binom{2n}{n} - \binom{2n}{n+1} = \frac{(2n)!}{n! n!} - \frac{(2n)!}{ (n+1)! (n-1)!} = \frac{ (2n)!}{ (n-1)! n!} \Big( \frac{1}{n} - \frac{1}{n+1}\Big)\\ & = \frac{ (2n)!}{ (n-1)! n! n (n+1)} = \frac{1}{n+1} \binom{2n}{n}. \end{align}\]
Let \(A\) denote the set of all
trajectories of \(T=2n+2\) steps with
\(x_0=x_{2n+2}=0\) that never dip below
zero, so that \(C_{n+1} = \# A\) (the
number of elements in \(A\)) We can
split \(A\) into several subclasses
according to the first time \(0\) is
re-visited after \(x_0=0\). More
precisely, let \(A_m\) denote the set
of all trajectories in \(A\) such that
\(x_{m}=0\), but \(x_k>0\) for \(0<k<m\). Since you can be back at
\(0\) only at even times, the sets
\(A_m\) with \(m\) odd are empty. Therefore, we can focus
on \(A_{2i}\) for \(i=1,2, \dots, n\). As shown on the plot
below (where \(T=20\) and \(2i =10\))
each trajectory in \(A_{2i}\) must look
like this: the first step is necessarily up (blue segment). After that
the trajectory stays at or above the level \(1\) until time \(x_{2i-1}\), when it must be exactly at
\(1\) (green segments). The next step
must be down (blue, to make good on the promise that \(x_{2i}=0\)), and after that, the trajectory
(red) is free to do whatever it wants (subject to remaining in the set
\(A\)).
The green portion of the graph is itself a trajectory that starts at the same level where it ends, and never goes below it. Its size is \(2(i-1)\), so the number of possible green sub-trajectories is \(C_{i-1}\). After that, the red trajectory is just like any other trajectory in \(A\), but of length \(2(n+1)-2i\), so the number of possible red trajectories is \(C_{n+1-i}.\) Hence, the number of trajectories in \(A_{2i}\) is \(C_{i-1} C_{n+1-i}.\) Since \(A\) is made up of \(A_2, A_4, \dots, A_{2(n+1)}\), we have \[ \#A = \sum_{i=1}^{n+1} \# A_{2i} = \sum_{i=1}^{n+1} C_{i-1} C_{n+1-i} = \sum_{i=0}^{n} C_{i} C_{n-i}. \] Note that we used tacitly that \(C_0=1\).
Let the number of pairings without intersections (as in the problem) be denoted by \(B_n\). The goal is to show that \(B_n = C_n\), where \(C_n\) is the \(n\)-th Catalan number.
We start by picking a vertex of the regular \(2n\)-gon and calling it \(V_1\). Starting from \(V_1\) and going clockwise, we name the
other vertices \(V_2,V_3,\dots,
V_{2n}\). The set of all pairings without intersections, as in
the problem, can be split into subclasses according to to which vertex
is paired with \(V_1\). More precisely,
let \(A_m\) be the set of all pairings
in which \(V_1\) is paired with \(V_m\). We first note that no pairing can
exist in which \(m\) is odd(why?), we
we can write \(m=2i\).
Each pairing in \(A_{2i}\) splits the set of \(2n\) points into three classes. Those points \(V_2, V_3, \dots, V_{2i-1}\)“to the left” of the segment \(V_1 V_{2i}\) (there are \(2i-2\) of those), the segment \(V_1 V_{2i}\) itself (blue), and the set of remaining points \(V_{2i+1}, \dots, V_{2n}\) to the right (there are \(2n - 2i\) of those. No pairing can use one point “to the left” and one point “to the right”, because the segment joining them would necessarily intersect \(V_1 V_{2i}\). Hence, the problem is reduced to two smaller, but similar, problems. We can pair the points “to the left” in \(B_{i-1}\) ways (the red segments), and those “to the right” in \(B_{n-i}\) ways (green segments), resulting in exactly \(B_{i-1} B_{n-i}\) possible pairings when \(V_1\) is paired with \(V_{2i}\). It remains to sum over \(i\) to obtain \(B_n = \sum_{i=1}^{n} B_{i-1} B_{n-i}\), or, equivalently, \[ B_{n+1} = \sum_{i=0}^{n} B_{i} B_{m-i}.\]
In other words \(B_n\)s satisfy the same recursive equation as the Catalan numbers \(C_n\)s. To conclude that \(B_n = C_n\), it will be enough to show that \(B_2 = C_2\). Using the formula for the \(n\)-th Catalan number, we have \(C_2 = 2\). On the other hand \(B_2\) is also \(2\) because there are two ways to pair the vertices of a square without intersections.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Given \(n\in{\mathbb{N}}\), what is the probability that \(X\) does not visit \(0\) during the time interval \(1,\dots, n\).
Let us denote the required probability by \(p_n\), i.e., \[p_n={\mathbb{P}}[ X_1\not= 0, X_2\not=0, \dots, X_n\not=0].\] For \(n=1\), \(p_1=1\), since \(X_1\) is either \(1\) or \(-1\). For \(n>1\), let \(\delta_1\) be the first increment \(\delta_1=X_1-X_0=X_1\). If \(\delta_1=-1\), we need to compute that probability that a random walk of length \(n-1\), starting at \(-1\), does not hit \(0\). This probability is, in turn, the same as the probability that a random walk of length \(n-1\), starting from \(0\), never hits \(1\). By the symmetry of the increments, the same reasoning works for the case \(\delta_1=1\). Therefore, \[\begin{align} p_n &= {\mathbb{P}}[ X_1\leq 0, X_2\leq 0, \dots, X_{n-1}\leq 0]\, {\mathbb{P}}[\delta_1=-1]\\ & \quad+ {\mathbb{P}}[ X_1\geq 0, X_2\geq 0, \dots, X_{n-1}\leq 0]\, {\mathbb{P}}[\delta_1=1]\\ &=\tfrac{1}{2}{\mathbb{P}}[ M_{n-1}=0] + \tfrac{1}{2}{\mathbb{P}}[M_{n-1}=0] = {\mathbb{P}}[ M_{n-1}=0], \end{align}\] where \(M_n=\max\{X_0,\dots, X_n\}\). Using the formula from the notes, this probability is given by \[p_n= 2^{-n+1} \binom{n-1}{ \lfloor n/2 \rfloor},\] where \(\lfloor x \rfloor\) denotes the largest integer \(\leq x\).
Let \(\tau_{-1}\) be the hitting time of the level \({-1}\) for a simple biased random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\). Choose the correct answer(s) (they will depend on the value of the parameter \(p\)):
Hitting the level \(-1\) for a biased random walk with parameter \(p\) is equivalent to hitting the level \(1\) for a biased random walk with parameter \(1-p\). The correct answers are: a and d, for \(p<1/2\), c, for \(p=1/2\) and b for \(p<1/2\).
Let \(\tau\) and \(\tilde{\tau}\) be two stopping times. Which of the following are necessarily stopping times, as well:
Either one of the following \(4\) random times is not a stopping time for a simple random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\), or they all are. Choose the one which is not in the first case, or choose e. if you think they all are.
The correct answer is e. The first, second, third, or … hitting times of a level are stopping times, and so are their minima or maxima. Note that for two stopping times \(\tau_1\) and \(\tau_2\), the one that happens first is \(\min(\tau_1,\tau_2)\) and the one that happens last is \(\max(\tau_1,\tau_2)\).
At most one of the following \(4\) random times is a stopping time for a simple random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\). Either choose the one which you think is a stopping time, or choose e. if you think there are no stopping times among them.
::: {.solution}
The correct answer is e. For each of the random times in a.-d. you need to know something about the future to tell whether they happened when they happened. For example, for \(c.\), you have no way of knowing (in general) whether or not \(X_{2} - X_1\) equals \(X_3\) at time \(2\).
The purpose of this problem is to understand how long we have to wait util a simple symmetric random walk hits the level \(1\). Theory presented so far guarantees that this will happen sooner or later, but it gives no indication of the length of the wait. As usual, we denote by \(\tau_1\) the (random) first time the SSRW \(\{X_n\}_{n\in{\mathbb{N}}_0}\) hits the level \(1\).
Write an R function that simulates a trajectory of a random walk,
but only until the first time it hits level \(1\). You don’t have to record the
trajectory itself - just keep tossing coins until the trajectory hits
\(1\) and return the number of steps
needed. Your function needs to accept an argument, T, such
that your simulation stops if \(1\) has
not been reached in the first T steps.
Pick a large-ish value of the parameter T (say \(100\)) and replicate the
simulation from 1. above sufficiently many times (say \(10,000\)). Draw a histogram of your
results.
Repeat the simulation for the following values of \(T\): \(500\), \(1,000\), \(10,000\), \(50,000\), \(100,000\), and compute the mean and the
standard deviation of your simulations. Display your results in two
tables. Are these numbers underestimates or overestimates of \({\mathbb{E}}[\tau_1]\) and \(\operatorname{Var}[\tau_1]\)? Explain why.
(Note: Decrease the number nsim of simulations to \(1000\) or even \(100\) if \(10,000\) is taking too long.)
Repeat all of the above, but for the first time the absolute value of your random walk reaches level \(5\). What is the most glaring difference between the two cases? What does that mean for the amount of time you are going to have to wait to hit \(1\), vs. for the absolute value to hit \(5\)? More precisely, what do you think their means and standard deviations are?
simulate_tau = function(T) {
X = 0
for (n in 1:T) {
X = X + sample(c(-1, 1), size = 1)
if (X == 1)
break
}
return(n)
}
nsim = 10000
T = 100
tau = replicate(nsim, simulate_tau(T))
hist(tau, probability = TRUE)
T = c(500, 1000, 10000, 50000, 100000)
Mean = vector(length = 5)
StDev = vector(length = 5)
nsim = 1000
for (i in 1:5) {
tau = replicate(nsim, simulate_tau(T[i]))
Mean[i] = mean(tau)
StDev[i] = sd(tau)
}
df = data.frame(T, Mean, StDev)
options(scipen = 50) # no scientific (e) notation
print(df)
## T Mean StDev
## 1 500 31 98
## 2 1000 43 158
## 3 10000 196 1156
## 4 50000 264 2865
## 5 100000 427 5077
simulate_tau_abs = function(T) {
X = 0
for (n in 1:T) {
X = X + sample(c(-1, 1), size = 1)
if (abs(X) == 5)
break
}
return(n)
}
nsim = 10000
T = 100
tau_abs = replicate(nsim, simulate_tau_abs(T))
hist(tau_abs, probability = TRUE)
T = c(500, 1000, 10000, 50000, 100000)
Mean = vector(length = 5)
StDev = vector(length = 5)
nsim = 10000
for (i in 1:5) {
tau_abs = replicate(nsim, simulate_tau_abs(T[i]))
Mean[i] = mean(tau_abs)
StDev[i] = sd(tau_abs)
}
df = data.frame(T, Mean, StDev)
options(scipen = 50) # no scientific (e) notation
print(df)
## T Mean StDev
## 1 500 25 20
## 2 1000 25 20
## 3 10000 25 20
## 4 50000 25 20
## 5 100000 25 20
The most glaring difference between two tables is that the mean and st-dev estimates seem to grow with \(T\) in the first, but not in the second case. It suggests that the random variable \(\tau\) takes such large values that no “cap” \(T\) can “contain them”. More precisely, the random variable \(\tau\) has infinite expectation (and also infinite standard deviation). Indeed, it its expectation were finite, the value in the “Mean” column would stabilize towards it. Since they don’t, this expectation is infinite. Same for standard deviation. The moral of the story is that even though simple symmetric random walks hit every level eventually, you may have to wait a long time for that to happen.
This does not happen for tau_abs. Indeed, it can be
shown that both its expectation and standard deviation are finite. The
time you are going wait until you hit either \(-5\) or \(5\) is much shorter “on average” than the
time needed to hit \(1\).
Simply put, a stochastic process has the Markov property if probabilities governing its future evolution depend only on its current position, and not on how it got there. Here is a more precise, mathematical, definition. It will be assumed throughout this course that any stochastic process \(\{X_n\}_{n\in {\mathbb{N}}_0}\) takes values in a countable set \(S\) called the state space. \(S\) will always be either finite, or countable, and a generic element of \(S\) will be denoted by \(i\) or \(j\).
A stochastic process \(\{X_n\}_{n\in {\mathbb{N}}_0}\) taking values in a countable state space \(S\) is called a Markov chain if \[\begin{equation} {\mathbb{P}}[ X_{n+1}=j|X_n=i, X_{n-1}=i_{n-1},\dots, X_1=i_1, X_0=i_0]= {\mathbb{P}}[ X_{n+1}=j|X_n=i], (\#eq:markov) \end{equation}\] for all times \(n\in{\mathbb{N}}_0\), all states \(i,j,i_0, i_1, \dots, i_{n-1} \in S\), whenever the two conditional probabilities are well-defined, i.e., when \[\begin{equation} {\mathbb{P}}[ X_n=i, \dots, X_1=i_1, X_0=i_0]>0. (\#eq:markov-well-defined) \end{equation}\]
The Markov property is typically checked in the following way: one computes the left-hand side of @ref(eq:markov) and shows that its value does not depend on \(i_{n-1},i_{n-2}, \dots, i_1, i_0\) (why is that enough?). The condition @ref(eq:markov-well-defined) will be assumed (without explicit mention) every time we write a conditional expression like to one in @ref(eq:markov).
All chains in this course will be homogeneous, i.e., the conditional probabilities \({\mathbb{P}}[X_{n+1}=j|X_{n}=i]\) will not depend on the current time \(n\in{\mathbb{N}}_0\), i.e., \({\mathbb{P}}[X_{n+1}=j|X_{n}=i]={\mathbb{P}}[X_{m+1}=j|X_{m}=i]\), for \(m,n\in{\mathbb{N}}_0\).
Markov chains are (relatively) easy to work with because the Markov property allows us to compute all the probabilities, expectations, etc. we might be interested in by using only two ingredients.
The initial distribution: \({a}^{(0)}= \{ {a}^{(0)}_i\, : \, i\in S\}\), \({a}^{(0)}_i={\mathbb{P}}[X_0=i]\) - the initial probability distribution of the process, and
Transition probabilities: \(p_{ij}={\mathbb{P}}[X_{n+1}=j|X_n=i]\) - the mechanism that the process uses to jump around.
Indeed, if you know \({a}^{(0)}_i\) and \(p_{ij}\) for all \(i,j\in S\) and want to compute a joint distribution \({\mathbb{P}}[X_n=i_n, X_{n-1}=i_{n-1}, \dots, X_0=i_0]\), you can use the definition of conditional probability and the Markov property several times (the multiplication theorem from your elementary probability course) as follows: \[\begin{align} {\mathbb{P}}[X_n=i_n, \dots, X_0=i_0] &= {\mathbb{P}}[X_n=i_n| X_{n-1}=i_{n-1}, \dots,X_0=i_0] \cdot {\mathbb{P}}[X_{n-1}=i_{n-1}, \dots,X_0=i_0] \\ & = {\mathbb{P}}[X_n=i_n| X_{n-1}=i_{n-1}] \cdot {\mathbb{P}}[X_{n-1}=i_{n-1}, \dots,X_0=i_0]\\ &= p_{i_{n-1} i_{n}} {\mathbb{P}}[X_{n-1}=i_{n-1}, \dots,X_0=i_0] \end{align}\] If we repeat the same procedure \(n-2\) more times (and flip the order of factors), we get \[\begin{align} {\mathbb{P}}[X_n=i_n, \dots, X _0=i_0] &= {a}^{(0)}_{i_0} \cdot p_{i_0 i_1} \cdot p_{i_1 i_2}\cdot \ldots \cdot p_{i_{n-1} i_{n}} \end{align}\] Think of it this way: the probability of the process taking the trajectory \((i_0, i_1, \dots, i_n)\) is:
When \(S\) is finite, there is no loss of generality in assuming that \(S=\{1,2,\dots, n\}\), and then we usually organize the entries of \({a}^{(0)}\) into a row vector \[{a}^{(0)}=({a}^{(0)}_1,{a}^{(0)}_2,\dots, {a}^{(0)}_n),\] and the transition probabilities \(p_{ij}\) into a square matrix \({\mathbf P}\), where \[{\mathbf P}=\begin{pmatrix} p_{11} & p_{12} & \dots & p_{1n} \\ p_{21} & p_{22} & \dots & p_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ p_{n1} & p_{n2} & \dots & p_{nn} \\ \end{pmatrix}\] In the general case (\(S\) possibly infinite), one can still use the vector and matrix notation as before, but it becomes quite clumsy. For example, if \(S={\mathbb{Z}}\), then \({\mathbf P}\) is an infinite matrix \[{\mathbf P}=\begin{pmatrix} \ddots & \vdots & \vdots & \vdots & \\ \dots & p_{-1\, -1} & p_{-1\, 0} & p_{-1\, 1} & \dots \\ \dots & p_{0\, -1} & p_{0\, 0} & p_{0\, 1} & \dots \\ \dots & p_{1\, -1} & p_{1\, 0} & p_{1\, 1} & \dots \\ & \vdots & \vdots & \vdots & \ddots \\ \end{pmatrix}\]
Here are some examples of Markov chains - you will see many more in problems and later chapters. Markov chains with a small number of states are often depicted as weighted directed graphs, whose nodes are the chain’s states, and the weight of the directed edge between \(i\) and \(j\) is \(p_{ij}\). Such graphs are called transition graphs and are an excellent way to visualize a number of important properties of the chain. A transition graph is included for most of the examples below. Edges are color-coded according to the probability assigned to them. Black is always \(1\), while other colors are uniquely assigned to different probabilities (edges carrying the same probability get the same color).
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple (possibly biased) random walk. Let us show that it indeed has the Markov property @ref(eq:markov). Remember, first, that \[X_n=\sum_{k=1}^n \delta_k \text{ where }\delta_k \text{ are independent (possibly biased) coin-tosses.}\] For a choice of \(i_0, \dots, i_n, j=i_{n+1}\) (such that \(i_0=0\) and \(i_{k+1}-i_{k}=\pm 1\)) we have \[%\label{equ:} \nonumber \begin{split} {\mathbb{P}}[ X_{n+1}=i_{n+1}&|X_n=i_n, X_{n-1}=i_{n-1},\dots, X_1=i_1, X_0=i_0]\\ = & {\mathbb{P}}[ X_{n+1}-X_{n}=i_{n+1}-i_{n}|X_n=i_n, X_{n-1}=i_{n-1},\dots, X_1=i_1, X_0=i_0] \\ = & {\mathbb{P}}[ \delta_{n+1}=i_{n+1}-i_{n}|X_n=i_n, X_{n-1}=i_{n-1},\dots, X_1=i_1, X_0=i_0] \\= & {\mathbb{P}}[ \delta_{n+1}=i_{n+1}-i_n], \end{split}\]
where the last equality follows from the fact that the increment \(\delta_{n+1}\) is independent of the previous increments, and, therefore, also of the values of \(X_1,X_2, \dots, X_n\). The last line above does not depend on \(i_{n-1}, \dots, i_1, i_0\), so \(X\) indeed has the Markov property.
The state space \(S\) of \(\{X_n\}_{n\in {\mathbb{N}}_0}\) is the set \({\mathbb{Z}}\) of all integers, and the initial distribution \({a}^{(0)}\) is very simple: we start at \(0\) with probability \(1\) (so that \({a}^{(0)}_0=1\) and \({a}^{(0)}_i=0\), for \(i\not= 0\).). The transition probabilities are simple to write down \[p_{ij}= \begin{cases} p, & j=i+1 \\ q, & j=i-1 \\ 0, & \text{otherwise.} \end{cases}\] If you insist, these can be written down in an infinite matrix, \[{\mathbf P}=\begin{pmatrix} \ddots & \vdots & \vdots & \vdots & \vdots & \vdots & \\ \dots & 0 & p & 0 & 0 & 0 & \dots \\ \dots & q & 0 & p & 0 & 0 & \dots \\ \dots &0 &q & 0 & p & 0 & \dots \\ \dots &0 &0 & q& 0 & p& \dots \\ \dots &0 & 0 &0 & q& 0& \dots \\ \dots &0 & 0 &0 & 0& q& \dots \\ & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \\ \end{pmatrix}\] but this representation is typically not as useful as in the finite case.
Here is a (portion of) a transition graph for a simple random walk. Instead of writing probabilities on top of the edges, we color code them as follows: green is \(p\) and orange is \(1-p\).
In Gambler’s ruin, a gambler starts with \(\$x\), where \(0\leq x \leq a\in{\mathbb{N}}\) and in each play wins a dollar (with probability \(p\in (0,1)\)) and loses a dollar (with probability \(q=1-p\)). When the gambler reaches either \(0\) or \(a\), the game stops. For mathematical convenience, it is usually a good idea to keep the chain defined, even after the modeled phenomenon stops. This is usually accomplished by simply assuming that the process “stays alive” but remains “frozen in place” instead of disappearing. In our case, once the gambler reaches either of the states \(0\) and \(a\), he/she simply stays there forever.
Therefore, the transition probabilities are similar to those of a random walk, but differ from them at the boundaries \(0\) and \(a\). The state space is finite \(S=\{0,1,\dots, a\}\) and the matrix \({\mathbf P}\) is given by \[{\mathbf P}=\begin{pmatrix} 1 & 0 & 0 & 0 & \dots & 0 & 0 & 0 \\ q & 0 & p & 0 & \dots & 0 & 0 & 0 \\ 0 & q & 0 & p & \dots & 0 & 0 & 0 \\ 0 & 0 & q & 0 & \dots & 0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \dots & 0 & p & 0 \\ 0 & 0 & 0 & 0 & \dots & q & 0 & p \\ 0 & 0 & 0 & 0 & \dots & 0 & 0 & 1 \\ \end{pmatrix}\]
In the picture below, green denotes the probability \(p\) and orange \(1-p\). As always, black is \(1\).
The initial distribution is deterministic: \[{a}^{(0)}_i= \begin{cases} 1,& i=x,\\ 0,& i\not= x. \end{cases}\]
Consider a system with two different states; think about a simple weather forecast (rain/no rain), high/low water level in a reservoir, high/low volatility regime in a financial market, high/low level of economic growth, the political party in power, etc. Suppose that the states are called \(1\) and \(2\) and the probabilities \(p_{12}\) and \(p_{21}\) of switching states are given. The probabilities \(p_{11}=1-p_{12}\) and \(p_{22}=1-p_{21}\) correspond to the system staying in the same state. The transition matrix for this Markov chain with \(S=\{1,2\}\) is \[{\mathbf P}= \begin{pmatrix} p_{11} & p_{12} \\ p_{21} & p_{22}. \end{pmatrix}\] When \(p_{12}\) and \(p_{21}\) are large (close to \(1\)) the system nervously jumps between the two states. When they are small, there are long periods of stability (staying in the same state).
One of the assumptions behind regime-switching models is that the transitions (switches) can only happen in regular intervals (once a minute, once a day, once a year, etc.). This is a feature of all discrete-time Markov chains. One would need to use a continuous-time model to allow for the transitions between states at any point in time.
A stochastic process \(\{X_n\}_{n\in {\mathbb{N}}_0}\) with state space \(S={\mathbb{N}}_0\) such that \(X_n=n\) for \(n\in{\mathbb{N}}_0\) (no randomness here) is called Deterministically monotone Markov chain (DMMC). The transition matrix looks like this \[{\mathbf P}= \begin{pmatrix} 0 & 1 & 0 & 0 & \dots \\ 0 & 0 & 1 & 0 & \dots \\ 0 & 0 & 0 & 1 & \dots \\ \vdots & \vdots & \vdots & \vdots & \ddots \end{pmatrix}\]
and the transition graph like this:
It is a pretty boring chain; its main use is as a counterexample.
Consider a frog jumping from a lily pad to a lily pad in a small forest pond. Suppose that there are \(N\) lily pads so that the state space can be described as \(S=\{1,2,\dots, N\}\). The frog starts on lily pad 1 at time \(n=0\), and jumps around in the following fashion: at time \(0\) it chooses any lily pad except for the one it is currently sitting on (with equal probability) and then jumps to it. At time \(n>0\), it chooses any lily pad other than the one it is sitting on and the one it visited immediately before (with equal probability) and jumps to it. The position \(\{X_n\}_{n\in {\mathbb{N}}_0}\) of the frog is not a Markov chain. Indeed, we have \[{\mathbb{P}}[X_3=1|X_2=2, X_1=3]= \frac{1}{N-2},\] while \[{\mathbb{P}}[X_3=1|X_2=2, X_1=1]=0.\]
A more dramatic version of this example would be the one where the frog remembers all the lily pads it had visited before, and only chooses among the remaining ones for the next jump.
How can we turn the process the previous example into a Markov chain. Obviously, the problem is that the frog has to remember the number of the lily pad it came from in order to decide where to jump next. The way out is to make this information a part of the state. In other words, we need to change the state space. Instead of just \(S=\{1,2,\dots, N\}\), we set \(S= \{ (i_1, i_2)\, : \, i_1,i_2 \in\{1,2,\dots N\}\}\). In words, the state of the process will now contain not only the number of the current lily pad (i.e., \(i_2\)) but also the number of the lily pad we came from (i.e., \(i_1\)). This way, the frog will be in the state \((i_1,i_2)\) if it is currently on the lily pad number \(i_2\), and it arrived here from \(i_1\). There is a bit of freedom with the initial state, but we simply assume that we start from \((1,1)\). Starting from the state \((i_1,i_2)\), the frog can jump to any state of the form \((i_2, i_3)\), \(i_3\not= i_1,i_2\) (with equal probabilities). Note that some states will never be visited (like \((i,i)\) for \(i\not = 1\)), so we could have reduced the state space a little bit right from the start.
It is important to stress that the passage to the new state space defines a whole new stochastic process. It is therefore, not quite accurate, as the title suggests, to say that we “turned” a non-Markov process into a Markov process. Rather, we replaced a non-Markovian model of a given situation by a different, Markovian, one.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a Markov chain on the state space \(S\), and let \(f:S\to T\) be a function. The stochastic process \(Y_n= f(X_n)\) takes values in \(T\); is it necessarily a Markov chain?
We will see in this example that the answer is no. Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk, with the usual state space \(S = {\mathbb{Z}}\). With \(r(m) = m\ (\text{mod } 3)\) denoting the remainder after the division by \(3\), we first define the process \(R_n = r(X_n)\) so that \[R_n=\begin{cases} 0, & \text{ if $X_n$ is divisible by 3,}\\ 1, & \text{ if $X_n-1$ is divisible by 3,}\\ 2, & \text{ if $X_n-2$ is divisible by 3.} \end{cases}\] Using \(R_n\) we define \(Y_n = (X_n-R_n)/3\) to be the corresponding quotient, so that \(Y_n\in{\mathbb{Z}}\) and \[3 Y_n \leq X_n <3 (Y_n+1).\] The process \(Y\) is of the form \(Y_n = f(X_n)\), where \(f(i)= \lfloor i/3 \rfloor\), and \(\lfloor x \rfloor\) is the largest integer not exceeding \(x\).
To show that \(Y\) is not a Markov chain, let us consider the the event \(A=\{Y_2=0, Y_1=0\}\). The only way for this to happen is if \(X_1=1\) and \(X_2=2\) or \(X_1=1\) and \(X_2=0\), so that \(A=\{X_1=1\}\). Also \(Y_3=1\) if and only if \(X_3=3\). Therefore \[{\mathbb{P}}[ Y_3=1|Y_2=0, Y_1=0]={\mathbb{P}}[ X_3=3| X_1=1]= 1/4.\] On the other hand, \(Y_2=0\) if and only if \(X_2=0\) or \(X_2=2\), so \({\mathbb{P}}[Y_2=0]= 3/4\). Finally, \(Y_3=1\) and \(Y_2=0\) if and only if \(X_3=3\) and so \({\mathbb{P}}[Y_3=1, Y_2=0]= 1/8\). Hence, \[{\mathbb{P}}[ Y_3=1|Y_2=0]={\mathbb{P}}[Y_3=1, Y_2=0]/{\mathbb{P}}[Y_2=0]= \frac{1/8}{3/4}= \frac{1}{6}.\] Therefore, \(Y\) is not a Markov chain. If you want a more dramatic example, try to modify this example so that one of the probabilities above is positive, but the other is zero.
The important property of the function \(f\) we applied to \(X\) is that it is not one-to-one. In other words, \(f\) collapses several states of \(X\) into a single state of \(Y\). This way, the “present” may end up containing so little information that the past suddenly becomes relevant for the dynamics of the future evolution.
In a game of tennis, the scoring system is as follows: both players start with the score of \(0\). Each time player 1 wins a point (a.k.a. a rally), her score moves a step up in the following hierarchy \[0 \mapsto 15 \mapsto 30 \mapsto 40.\] Once she reaches \(40\) and scores a point, three things can happen:
if the score of player 2 is \(30\) or less, player 1 wins the game.
if the score of player 2 \(40\), the score of player 1 moves up to “advantage”, and
if the score of player 2 is “advantage”, nothing happens to the score of player 1 but the score of player 2 falls back to \(40\).
Finally, if the score of player 1 is “advantage” and she wins a point, she wins the game. The situation is entirely symmetric for player 2. We suppose that the probability that player 1 wins each point is \(p\in (0,1)\), independently of the current score. A situation like this is a typical example of a Markov chain in an applied setting. What are the states of the process? We obviously need to know both players’ scores and we also need to know if one of the players has won the game. Therefore, a possible state space is the following:
\[\begin{align} S= \Big\{ &(0,0), (0,15), (0,30), (0,40), (15,0), (15,15), (15,30), (15,40), (30,0), (30,15),\\ & (30,30), (30,40), (40,0), (40,15), (40,30), (40,40), (40,A), (A,40), P_1, P_2 \Big\} \end{align}\]
where \(A\) stands for “advantage” and \(P_1\) (resp., \(P_2\)) denotes the state where player 1 (resp., player 2) wins. It is not hard to assign probabilities to transitions between states. Once we reach either \(P_1\) or \(P_2\) the game stops. We can assume that the chain remains in that state forever, i.e., the state is absorbing.
The initial distribution is quite simple - we always start from the same state \((0,0)\), so that \({a}^{(0)}_{(0,0)}=1\) and \({a}^{(0)}_i=0\) for all \(i\in S\setminus\{(0,0)\}\).
How about the transition matrix? When the number of states is big (\(\# S=20\) in this case), transition matrices are useful in computer memory, but not so much on paper. Just for the fun of it, here is the transition matrix for our game-of-tennis chain (I am going to leave it up to you to figure out how rows/columns of the matrix match to states) \[ {\mathbf P}= \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & q & 0 & 0 & 0 & 0 & 0 & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & q & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & p & 0 & 0 & 0 \\ 0 & q & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & p & 0 & 0 \\ p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & 0 & 0 \\ p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 & 0 \\ p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & p \\ 0 & q & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & p & 0 & 0 \\ p & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & q & 0 & 0 \\ \end{pmatrix} \]
Does the structure of a game of tennis make is easier or harder for the better player to win? In other words, if you had to play against the best tennis player in the world (I am rudely assuming that he or she is better than you), would you have a better chance of winning if you only played a point (rally), or if you played the whole game? We will give a precise answer to this question in a little while. In the meantime, try to guess.
The transition probabilities \(p_{ij}\), \(i,j\in S\) tell us how a Markov chain jumps from one state to another in a single step. Think of it as a description of the local behavior of the chain. This is the information one can usually obtain from observations and modeling assumptions. On the other hand, it is the global (long-time) behavior of the model that provides the most interesting insights. In that spirit, we turn our attention to probabilities like this: \[{\mathbb{P}}[X_{k+n}=j|X_k=i] \text{ for } n = 1,2,\dots.\] Since we are assuming that all of our chains are homogeneous (transition probabilities do not change with time), this probability does not depend on the time \(k\), so we can define the multi-step transition probabilities \(p^{(n)}_{ij}\) as follows: \[p^{(n)}_{ij}={\mathbb{P}}[X_{k+n}=j|X_{k}=i]={\mathbb{P}}[ X_{n}=j|X_0=i].\] We allow \(n=0\) under the useful convention that \[p^{(0)}_{ij}=\begin{cases} 1, & i=j,\\ 0,& i\not = j. \end{cases}\] We note right away that the numbers \(p^{(n)}_{ij}\), \(i,j\in S\) naturally fit into an \(N\times N\)-matrix which we denote by \({\mathbf P}^{(n)}\). We note right away that \[\begin{equation} {\mathbf P}^{(0)}= \operatorname{Id}\text{ and } {\mathbf P}^{(1)}= {\mathbf P}, (\#eq:Przo) \end{equation}\] where \(\operatorname{Id}\) denotes the \(N\times N\) identity matrix.
The central result of this section is the following sequence of equalities connecting \({\mathbf P}^{(n)}\) for different values of \(n\), know as the Chapman-Kolmogorov equations: \[\begin{equation} {\mathbf P}^{(m+n)} = {\mathbf P}^{(m)} {\mathbf P}^{(n)}, \text{ for all } m,n \in {\mathbb{N}}_0. (\#eq:CK) \end{equation}\] To see why this is true we start by computing \({\mathbb{P}}[ X_{n+m} = j, X_0=i]\). Since each trajectory from \(i\) to \(j\) in \(n+m\) steps has be somewhere at time \(n\), we can write \[\begin{equation} {\mathbb{P}}[ X_{n+m}= j, X_0 = i] = \sum_{k\in S} {\mathbb{P}}[X_{n+m} = j, X_{n} = k, X_0 = i]. (\#eq:one-CK) \end{equation}\] By the multiplication rule, we have \[\begin{multline} {\mathbb{P}}[X_{n+m} = j, X_{n} = k, X_0 = i] = {\mathbb{P}}[ X_{n+m} = j | X_{n}=k, X_{0}=i] {\mathbb{P}}[X_{n}=k, X_0 = i], (\#eq:two-CK) \end{multline}\] and then, by the Markov property: \[\begin{equation} {\mathbb{P}}[ X_{n+m} = j | X_{n}=k, X_{0}=i] = {\mathbb{P}}[ X_{n+m} = j | X_n = k]. (\#eq:three-CK) \end{equation}\] Combining @ref(eq:one-CK), @ref(eq:two-CK) and @ref(eq:three-CK) we obtain the following equality: \[\begin{equation} {\mathbb{P}}[ X_{n+m}= j, X_0 = i] = \sum_{k\in S} {\mathbb{P}}[ X_{n+m} = j | X_n = k] {\mathbb{P}}[X_{n}=k, X_0 = i]. \end{equation}\] which is nothing but @ref(eq:CK); to see that, just remember how matrices are multiplied.
The punchline is that @ref(eq:CK), together with @ref(eq:Przo) imply that \[\begin{equation} {\mathbf P}^{(n)}= {\mathbf P}^n, (\#eq:Prn-Pn) \end{equation}\] where the left-hand side is the matrix composed of the \(n\)-step transition probabilities, and the right hand side is the \(n\)-th (matrix) power of the (\(1\)-step) transition matrix \({\mathbf P}\). Using @ref(eq:Prn-Pn) allows us to write a simple expression for the distribution of the random variable \(X_n\), for \(n\in{\mathbb{N}}_0\). Remember that the initial distribution (the distribution of \(X_0\)) is denoted by \({a}^{(0)}=({a}^{(0)}_i)_{i\in S}\). Analogously, we define the vector \({a}^{(n)}=({a}^{(n)}_i)_{i\in S}\) by \[{a}^{(n)}_i={\mathbb{P}}[X_n=i],\ i\in S.\] Using the law of total probability, we have \[{a}^{(n)}_i={\mathbb{P}}[X_n=i]=\sum_{k\in S} {\mathbb{P}}[ X_0=k] {\mathbb{P}}[ X_n=i|X_0=k]= \sum_{k\in S} {a}^{(0)}_k p^{(n)}_{ki}.\] We usually interpret \({a}^{(0)}\) as a (row) vector, so the above relationship can be expressed using vector-matrix multiplication \[{a}^{(n)}={a}^{(0)}{\mathbf P}^n.\]
Find an explicit expression for \({\mathbf P}^{(n)}\) in the case of the regime-switching chain introduced above. Feel free to assume that \(p_{ij}>0\) for all \(i,j\).
It is often difficult to compute \({\mathbf P}^n\) for a general transition matrix \({\mathbf P}\) and a large \(n\). We will see later that it will be easier to find the limiting values \(\lim_{n\to\infty}p^{(n)}_{ij}\). The regime-switching chain is one of the few examples where everything can be done by hand.
By @ref(eq:Prn-Pn), we need to compute the \(n\)-th matrix power of the transition matrix \({\mathbf P}\). To make the notation a bit nicer, let us write \(a\) for \(p_{12}\) and \(b\) for \(p_{21}\), so that we can write \[{\mathbf P}= \begin{pmatrix} 1-a & a \\ b & 1-b \end{pmatrix}\]
The winning idea is to use diagonalization, and for that we start by writing down the characteristic equation \(\det (\lambda I-{\mathbf P})=0\) of the matrix \({\mathbf P}\): \[\label{equ:} \nonumber \begin{split} 0&=\det(\lambda I-{\mathbf P})= \begin{vmatrix} \lambda-1+a & -a \\ -b & \lambda-1+b \end{vmatrix}\\ & =((\lambda-1)+a)((\lambda-1)+b)-ab =(\lambda-1)(\lambda-(1-a-b)). \end{split}\] The eigenvalues are, therefore, \(\lambda_1=1\) and \(\lambda_2=1-a-b\), and the corresponding eigenvectors are \(v_1=\binom{1}{1}\) and \(v_2=\binom{a}{-b}\). Therefore, if we define \[V= \begin{pmatrix} 1 & a \\ 1 & -b \end{pmatrix} \text{ and }D= \begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}= \begin{pmatrix} 1 & 0 \\ 0 & (1-a-b) \end{pmatrix}\] we have \[{\mathbf P}V = V D,\text{ i.e., } {\mathbf P}= V D V^{-1}.\] This representation is very useful for taking (matrix) powers: \[\label{equ:60C4} \begin{split} {\mathbf P}^n &= (V D V^{-1})( V D V^{-1}) \dots (V D V^{-1})= V D^n V^{-1} \\ & = V \begin{pmatrix} 1 & 0 \\ 0 & (1-a-b)^n \end{pmatrix} V^{-1} \end{split}\] We assumed that all \(p_{ij}\) are positive which means, in particular, that \(a+b>0\), so that \[\begin{align} V^{-1} = \tfrac{1}{a+b} \begin{pmatrix} b & a \\ 1 & -1 \end{pmatrix}, \end{align}\]
and so \[\begin{align} {\mathbf P}^n &= V D^n V^{-1}= \begin{pmatrix} 1 & a \\ 1 & -b \end{pmatrix} \begin{pmatrix} 1 & 0 \\ 0 & (1-a-b)^n \end{pmatrix} \tfrac{1}{a+b} \begin{pmatrix} b & a \\ 1 & -1 \end{pmatrix}\\ &= \frac{1}{a+b} \begin{pmatrix} b & a \\ b & a \end{pmatrix} + \frac{(1-a-b)^n}{a+b} \begin{pmatrix} a & -a \\ b & -b \end{pmatrix}\\ &=\begin{pmatrix} \frac{b}{a+b}+(1-a-b)^n \frac{a}{a+b} & \frac{a}{a+b}-(1-a-b)^n \frac{a}{a+b}\\ \frac{b}{a+b}+(1-a-b)^n \frac{b}{a+b} & \frac{a}{a+b}-(1-a-b)^n \frac{b}{a+b} \end{pmatrix} \end{align}\]
The expression for \({\mathbf P}^n\) above tells us a lot about the structure of the multi-step probabilities \(p^{(n)}_{ij}\) for large \(n\). Note that the second matrix on the right-hand side above comes multiplied by \((1-a-b)^n\) which tends to \(0\) as \(n\to\infty\) (under our assumptions that \(p_{ij}>0\).) We can, therefore, write \[{\mathbf P}^n\sim \frac{1}{a+b} \begin{pmatrix} b & a \\ b & a \end{pmatrix} \text{ for large } n.\] The fact that the rows of the right-hand side above are equal points to the fact that, for large \(n\), \(p^{(n)}_{ij}\) does not depend (much) on the initial state \(i\). In other words, this Markov chain forgets its initial condition after a long period of time. This is a rule more than an exception, and we will study such phenomena in the following lectures.
One of the (many) reasons Markov chains are a popular modeling tool is the ease with which they can be simulated. When we simulated a random walk, we started at \(0\) and built the process by adding independent coin-toss-distributed increments. We obtained the value of the next position of the walk by adding the present position and the value of an independent random variable. For general Markov chain, this procedure works almost verbatim, except that the function that combines the present position and a value of an independent random variable may be something other than addition. In general, we collapse the two parts of the process - a simulation of an independent random variable and its combination with the present position - into one. Given our position, we pick the row of the transition matrix that corresponds to it and then use its elements as the probabilities that govern our position tomorrow. It will all be clear once you read through the solution of the following problem.
Simulate \(1000\) trajectories of a gambler’s ruin Markov chain with \(a=3\), \(p=2/3\) and \(x=1\) (see subsection @ref(gambler) above for the meaning of these constants). Use the Monte Carlo method to estimate the probability that the gambler will leave the casino with \(\$3\) in her pocket in at most \(T=100\) time periods.
# state space
S = c(0, 1, 2, 3)
# transition matrix
P = matrix(c(1, 0, 0, 0,
1/3, 0, 2/3, 0,
0, 1/3, 0, 2/3,
0, 0, 0, 1),
byrow=T, ncol=4)
T = 100 # number of time periods
nsim = 1000 # number of simulations
# simulate the next position of the chain
draw_next = function(s) {
i = match(s, S) # the row number of the state s
sample(S, prob = P[i, ], size = 1)
}
# simulate a single trajectory of length T
# from the initial state
single_trajectory = function(initial_state) {
path = numeric(T)
last = initial_state
for (n in 1:T) {
path[n] = draw_next(last)
last = path[n]
}
return(path)
}
# simulate the entire chain
simulate_chain = function(initial_state) {
data.frame(X0 = initial_state,
t(replicate(
nsim, single_trajectory(initial_state)
)))
}
df = simulate_chain(1)
(p = mean(df$X100 == 3))
## [1] 0.59
R. The function draw_next is at the heart of
the simulation. Given the current state s, it looks up the
row of the transition matrix P which corresponds to
s. This is where the function match comes in
handy - match(s,S) gives you the position of th element
s in the vector S. Of course, if \(S = \{ 1,2,3, \dots, n\}\) then we don’t
need to use match, as each state is “its own position”. In
our case, S is a bit different, namely \(S=\{0,1,2,3\}\), and so
match(s,S) is nothing by s+1. This is clearly
an overkill in this case, but we still do it for didactical
purposes.
Once the row corresponding to the state s is identified,
we use its elements as the probabilities to be fed into the command
sample, which, in turn, returns our next state and we
repeat the procedure over and over (in this case \(T=100\) times).
Let \(\{Y_n\}_{n\in {\mathbb{N}}_0}\) be a sequence of die-rolls, i.e., a sequence of independent random variables which take values \(1,2,\dots, 6\), each with probability \(1/6\). Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a stochastic process defined by \[X_n=\max (Y_0,Y_1, \dots, Y_n), \ n\in{\mathbb{N}}_0.\] In words, \(X_n\) is the maximal value rolled so far. Is \(\{X_n\}_{n\in {\mathbb{N}}_0}\) a Markov chain? If it is, find its transition matrix and the initial distribution. If it is not, give an example of how the Markov property is violated.
It turns out that \(\{X_n\}_{n\in{\mathbb{N}}}\) is, indeed, a Markov chain. The value of \(X_{n+1}\) is either going to be equal to \(X_n\) if \(Y_{n+1}\) happens to be less than or equal to it, or it moves up to \(Y_{n+1}\), otherwise, i.e., \(X_{n+1}=\max(X_n,Y_{n+1})\). Therefore, the distribution of \(X_{n+1}\) depends on the previous values \(X_0,X_1,\dots, X_n\) only through \(X_n\), and, so, \(\{X_n\}_{n\in {\mathbb{N}}_0}\) is a Markov chain on the state space \(S=\{1,2,3,4,5,6\}\). The transition matrix is given by \[P=\begin{pmatrix} 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 \\ 0 & 1/3 & 1/6 & 1/6 & 1/6 & 1/6 \\ 0 & 0 & 1/2 & 1/6 & 1/6 & 1/6 \\ 0 & 0 & 0 & 2/3 & 1/6 & 1/6 \\ 0 & 0 & 0 & 0 & 5/6 & 1/6 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ \end{pmatrix}\]
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. For \(n\in{\mathbb{N}}_0\), define \(Y_n = 2X_n+1\), and let \(Z_n\) be the amount of time \(X_n\) spent strictly above \(0\) up to (and including) time \(n\), i.e. \[Z_0=0, Z_{n+1} - Z_n = \begin{cases} 1, & X_{n+1}>0 \\ 0, & X_ {n+1}\leq 0 \end{cases} , \text{ for }n\in{\mathbb{N}}_0.\] Is \(Y\) a Markov chain? Is \(Z\)?
\(Y\) is a Markov chain because it is just a random walk started at \(1\) with steps of size \(2\) (a more rigorous proof would follow the same line of reasoning as the proof that random walks are Markov chains). \(Z\) is not a Markov chain because the knowledge of far history (beyond the present position) affects the likelihood of the next transition as the following example shows: \[\begin{aligned} {\mathbb{P}}[ Z_4=2| Z_0=0, Z_1=0, Z_2=0, Z_3=1]=1/2\end{aligned}\] but \[\begin{aligned} {\mathbb{P}}[ Z_4=2| Z_0=0, Z_1=1, Z_2=1, Z_3=1]= 0.\end{aligned}\]
Let \(\{\delta_n\}_{n\in{\mathbb{N}}}\) be a sequence of independent coin tosses (i.e., random variables with values \(T\) or \(H\) with equal probabilities). Let \(X_0=0\), and, for \(n\in{\mathbb{N}}\), let \(X_n\) be the number of times two consecutive \(\delta\)s take the same value in the first \(n+1\) tosses. For example, if the outcome of the coin tosses is TTHHTTTH …, we have \(X_0=0\), \(X_1=1\), \(X_2=1\), \(X_3=2\), \(X_4=2\), \(X_5=3\), \(X_6=4\), \(X_7=4\), …
Is \(\{X_n\}_{n\in {\mathbb{N}}_0}\) a Markov chain? If it is, describe its state space, the transition probabilities and the initial distribution. If it is not, show exactly how the Markov property is violated.
Yes, the process \(X\) is a Markov chain, on the state space \(S={\mathbb{N}}_0\). To show that we make the following simple observation: we have \(X_{n}-X_{n-1}=1\) if \(\delta_n=\delta_{n+1}\) and \(X_n-X_{n-1}=0\), otherwise (for \(n\in{\mathbb{N}}\)). Therefore, \[{\mathbb{P}}[ X_{n+1}=i_n+1 | X_{n}=i_n, \dots, X_1=i_1, X_0=0] = {\mathbb{P}}[ \delta_{n+2}=\delta_{n+1} | X_{n}=i_n, \dots, X_1=i_1,X_0=0].\] Even if we knew the exact values of all \(\delta_1,\dots, \delta_n,\delta_{n+1}\), the (conditional) probability that \(\delta_{n+2}=\delta_{n+1}\) would still be \(1/2\), regardless of these values. Therefore, \[{\mathbb{P}}[ X_{n+1}=i_n+1| X_n=i_n,\dots, X_1=i_1, X_0=0] = \tfrac{1}{2},\] and, similarly, \[{\mathbb{P}}[ X_{n+1}=i_n| X_n=i_n,\dots, X_1=i_1, X_0=0] = \tfrac{1}{2}.\] Therefore, the conditional probability given all the past depends on the past only through the value of \(X_n\) (the current position), and we conclude that \(X\) is, indeed, a Markov process. Its initial distribution is deterministic \({\mathbb{P}}[X_0=0]=1\), and the transition probabilities, as computed above, are \[p_{ij}={\mathbb{P}}[ X_{n+1}=j| X_n=i] = \begin{cases} 1/2, &\text{ if } j=i+1, \\ 1/2, &\text{ if } j=i, \\ 0, &\text{ otherwise.} \end{cases}\] In fact, \(2 X_n - n\) is a simple symmetric random walk.
Let \(X\) be a Markov chain on \(N\) states, with the \(N\times N\) transition matrix \(P\). We construct a new Markov chain \(Y\) from the transition mechanism of \(X\) as follows: at each point in time, we toss a biased coin (probability of heads \(p\in (0,1)\)), independently of everything else. If it shows heads we move according to the transition matrix of \(X\). If it shows tails, we remain in the same state. What is the transition matrix of \(Y\)?
Let \(Q=(q_{ij})\) denote the transition probability for the chain \(Y\). When \(i\ne j\), the chain \(Y\) will go from \(i\) to \(j\) in one step if and only if the coin shows heads and the chain \(X\) wants to jump from \(i\) to \(j\). Since the two events are independent, the probability of the former is \(p\), and of the later is \(p_{ij}\), we have \(q_{ij} = p p_{ij}\).
In the case \(i=j\), the chain \(Y\) will transition from \(i\) to \(i\) (i.e., stay in \(i\)) if either the coin shows heads, or if the coin shows tails and the chain \(X\) decides to stay in \(i\). Therefore, \(q_{ii} = p + (1-p) p_{ij}\), i.e., \[ Q = p \operatorname{Id}+(1-p) P,\] where \(\operatorname{Id}\) denotes \(N\times N\) identity matrix.
The red container has 100 red balls, and the blue container has 100 blue balls. In each step
- a container is selected (with equal probabilities),
- a ball is selected from it (all balls in the container are equally likely to be selected), and
- the selected ball is placed in the other container. If the selected container is empty, no ball is transferred.
Once there are 100 blue balls in the red container and 100 red balls in the blue container, the game stops.
We decide to model the situation as a Markov chain.
What is the state space \(S\) we can use? How large is it?
What is the initial distribution?
What are the transition probabilities between states? Don’t write the matrix, it is way too large; just write a general expression for \(p_{ij}\), \(i,j\in S\).
(Note: this is a version of the famous Ehrenfest Chain from statistical physics.)
There are many ways in which one can solve this problem. Below is just one of them.
In order to describe the situation being modeled, we need to keep track of the number of balls of each color in each container. Therefore, one possibility is to take the set of all quadruplets \((r,b,R,B)\), \(r,b,R,b\in \{0,1,2,\dots, 100\}\) and this state space would have \(101^4\) elements. We know, however, that the total number of red balls, and the total number of blue balls is always equal to 100, so the knowledge of the composition of the red (say) container is enough to reconstruct the contents of the blue container. In other words, we can use the number of balls of each color in the red container only as our state, i.e. \[S= \{ (r,b)\, : \, r,b=0,1,\dots, 100\}.\] This state space has \(101\times 101=10201\) elements.
The initial distribution is deterministic: \({\mathbb{P}}[X_0=(100,0)]=1\) and \({\mathbb{P}}[X_0=i]=0\), for \(i\in S\setminus\{(100,0)\}\). In the vector notation, \[{a}^{(0)}=(0,0, \dots, 0, 1, 0, \dots, 0),\] where \(1\) is at the place corresponding to \((100,0)\).
Let us consider several separate cases, with the understanding that \(p_{ij}=0\), for all \(i,j\) not mentioned explicitly below:
One of the containers is empty. In that case, we are either in \((0,0)\) or in \((100,100)\). Let us describe the situation for \((0,0)\) first. If we choose the red container - and that happens with probability \(\tfrac{1}{2}\) - we stay in \((0,0)\): \[p_{(0,0),(0,0)}=\tfrac{1}{2}.\] If the blue container is chosen, a ball of either color will be chosen with probability \(\tfrac{100}{200}=\tfrac{1}{2}\), so \[p_{(0,0),(1,0)}=p_{(0,0),(0,1)}=\tfrac{1}{4}.\] By the same reasoning, \[p_{(100,100),(0,0)}=\tfrac{1}{2}\text{ and } p_{(100,100),(99,100)}=p_{(100,100),(100,99)}=\tfrac{1}{4}.\]
We are in the state \((0,100)\). By the description of the model, this is an absorbing state, so \(p_{(0,100),(0,100)}=1.\)
All other cases Suppose we are in the state \((r,b)\) where \((r,b)\not\in\{(0,100),(0,0),(100,100)\}\). If the red container is chosen, then the probability of getting a red ball is \(\tfrac{r}{r+b}\), so \[p_{(r,b),(r-1,b)}= \tfrac{1}{2}\tfrac{r}{r+b}.\] Similarly, \[p_{(r,b),(r,b-1)}= \tfrac{1}{2}\tfrac{b}{r+b}.\] In the blue container there are \(100-r\) red and \(100-b\) blue balls. Thus, \[p_{(r,b),(r+1,b)}= \tfrac{1}{2}\tfrac{100-r}{200-r-b},\] and \[p_{(r,b),(r,b+1)}= \tfrac{1}{2}\tfrac{100-b}{200-r-b}.\]
A “deck” of cards starts with 2 red and 2 black cards. A “move” consists of the following:
- pick a random card from the deck (if the deck is empty, do nothing),
- if the card is black and the card drawn on the previous move was also black, return it back to the deck,
- otherwise, throw the card away (this, in particular, applies to any card drawn on the first move, since there is no “previous” move at that time).
Model the situation using a Markov chain: find an appropriate state space, and sketch the transition graph with transition probabilities. How small can you make the state space?
What is the probability that the deck will be empty after exactly \(4\) moves? What is the probability that the deck will be empty eventually?
We need to keep track of the number of remaining cards of each color in the deck, as well as the color of the last card we picked (except at the beginning or when the deck is empty, when it does not matter). Therefore, the initial state will be \((2,2)\), the empty-deck state will be \((0,0)\) and the other states will be triplets of the form \((\#r, \#b, c)\), where \(\#r\) and \(\#b\) denote the number of cards (red and black) in the deck, and \(c\) is the color, \(R\) or \(B\), of the last card we picked. This way, the initial guess for the state space would be \[\begin{aligned} S_0 = \{&(2,2), (0,0),\\ & (2,1,B), (2,1,R), (1,2,B), (1,2,R),\\ & (1,1,B), (1,1,R), (0,2,B), (2,0,R), (2,0,B), (0,2,R),\\ & (0,1,B), (0,1,R), (1,0,B), (1,0,R) \} \end{aligned}\]
In order to decrease the size of the state space, we start the chain at \((2,2)\) and consider all trajectories it is possible to take from there. It turns out that states \((2,1,R), (1,2,B), (0,2,B), (2,0,R), (2,0,B)\) and \((1,0,R)\) can never be reached from \((2,2)\), so we might as well leave them out of the state space. That reduces the initial guess \(S_0\) to a smaller \(10\)-state, version \[\begin{equation} S = \{(2,2), (0,0), (2,1,B), (1,2,R), (1,1,B), (1,1,R), (0,2,R), (0,1,B), (0,1,R), (1,0,B) \} \end{equation}\] with the following transition graph:
You could further reduce the number of states to \(9\) by removing the initial state \((2,2)\) and choosing a non-deterministic distribution over the states that can be reached from them. There is something unsatisfying about that, though.
To get from \((2,2)\) to \((0,0)\) in exactly four steps, we need to follow one of the following three paths: \[\begin{aligned} & (2,2) \to (2,1,B) \to (1,1,R) \to (1,0,B) \to (0,0), \\ & (2,2) \to (2,1,B) \to (1,1,R) \to (0,1,R) \to (0,0), \text{ or }\\ & (2,2) \to (1,2,R) \to (1,1,B) \to (0,1,R) \to (0,0). \\ \end{aligned}\] Their respective probabilities happen to be the same, namely \(\tfrac{1}{2}\times \tfrac{2}{3} \times \tfrac{1}{2}\times 1 = \frac{1}{6}\), so the probability of hitting \((0,0)\) in exactly \(4\) steps is \(3 \times \frac{1}{6} = \tfrac{1}{2}\).
To compute the probability of hitting \((0,0)\) eventually, we note that this is guaranteed to happen sooner or later (see the graph above) if the first card we draw is black. It is also guaranteed to happen is the first card we draw is red, but the second one is black. In fact, the only way for this not to happen is to draw two red cards on the first two draws. This happens with probability \(\tfrac{1}{2}\times \frac{1}{3} = \frac{1}{6}\), so the required probability of ending up with an empty deck is \(1 - \frac{1}{6} = \frac{5}{6}\).
A country has \(m+1\) cities (\(m\in{\mathbb{N}}\)), one of which is the capital. There is a direct railway connection between each city and the capital, but there are no tracks between any two “non-capital” cities. A traveler starts in the capital and takes a train to a randomly chosen non-capital city (all cities are equally likely to be chosen), spends a night there and returns the next morning and immediately boards the train to the next city according to the same rule, spends the night there, …, etc. We assume that her choice of the city is independent of the cities visited in the past. Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be the number of visited non-capital cities up to (and including) day \(n\), so that \(X_0=1\), but \(X_1\) could be either \(1\) or \(2\), etc.
Explain why \(\{X_n\}_{n\in {\mathbb{N}}_0}\) is a Markov chain on the appropriate state space \({\mathcal{S}}\) and the find the transition probabilities of \(\{X_n\}_{n\in {\mathbb{N}}_0}\), i.e., write an expression for \[{\mathbb{P}}[X_{n+1}=j|X_n=i], \text{ for $i,j\in S$.}\]
Let \(\tau_m\) be the first time the traveler has visited all \(m\) non-capital cities, i.e. \[\tau_m=\min \{ n\in{\mathbb{N}}_0\, : \, X_n=m\}.\] What is the distribution of \(\tau_m\), for \(m=1\) and \(m=2\).
Compute \({\mathbb{E}}[\tau_m]\) for general \(m\in{\mathbb{N}}\). What is the asymptotic behavior of \({\mathbb{E}}[\tau_m]\) as \(m\to\infty\)? More precisely, find a simple function \(f(m)\) of \(m\) (like \(m^2\) or \(\log(m)\)) such that \({\mathbb{E}}[\tau_m] \sim f(m)\), i.e., \(\lim_{m\to\infty} \frac{{\mathbb{E}}[\tau_m]}{f(m)} = 1\).
The natural state space for \(\{X_n\}_{n\in {\mathbb{N}}_0}\) is \(S=\{1,2,\dots, m\}\). It is clear that \({\mathbb{P}}[X_{n+1}=j|X_n=i]=0,\) unless, \(i=j\) or \(i=j+1\). If we start from the state \(i\), the process will remain in \(i\) if the traveler visits one of the already-visited cities, and move to \(i+1\) is the visited city has never been visited before. Thanks to the uniform distribution in the choice of the next city, the probability that a never-visited city will be selected is \(\tfrac{m-i}{m}\), and it does not depend on the (names of the) cities already visited, or on the times of their first visits; it only depends on their number. Consequently, the extra information about \(X_1,X_2,\dots, X_{n-1}\) will not change the probability of visiting \(j\) in any way, which is exactly what the Markov property is all about. Therefore, \(\{X_n\}_{n\in{\mathbb{N}}}\) is Markov and its transition probabilities are given by \[p_{ij}={\mathbb{P}}[X_{n+1}=j|X_{n}=i]= \begin{cases} 0, & j\not \in \{i,i+1\}\\ \tfrac{m-i}{m}, & j=i+1\\ \tfrac{i}{m}, & j=i. \end{cases}\] (Note: the situation would not be nearly as nice if the distribution of the choice of the next city were non-uniform. In that case, the list of the (names of the) already-visited cities would matter, and it is not clear that the described process has the Markov property (does it?). )
For \(m=1\), \(\tau_m=0\), so its distribution is deterministic and concentrated on \(0\). The case \(m=2\) is only slightly more complicated. After having visited his first city, the visitor has a probability of \(\tfrac{1}{2}\) of visiting it again, on each consecutive day. After a geometrically distributed number of days, he will visit another city and \(\tau_2\) will be realized. Therefore the distribution \(\{p_n\}_{n\in {\mathbb{N}}_0}\) of \(\tau_2\) is given by \[p_0=0, p_1=\tfrac{1}{2}, p_2=(\tfrac{1}{2})^2, p_3=(\tfrac{1}{2})^3,\dots\]
For \(m>1\), we can write \(\tau_m\) as \[\tau_m=\tau_1+(\tau_2-\tau_1)+\dots +(\tau_m-\tau_{m-1}),\] so that \[{\mathbb{E}}[\tau_m]={\mathbb{E}}[\tau_1]+{\mathbb{E}}[\tau_2-\tau_1]+\dots+{\mathbb{E}}[\tau_m-\tau_{m-1}].\] We know that \(\tau_1=0\) and for \(k=1,2,\dots, m-1\), the difference \(\tau_{k+1}-\tau_{k}\) denotes the waiting time before a never-before-visited city is visited, given that the number of already-visited cities is \(k\). This random variable is geometric with success probability given by \(\tfrac{m-k}{m}\), so its expectation is given by \[{\mathbb{E}}[\tau_{k+1}-\tau_k]= \frac{1}{ \tfrac{m-k}{m}}=\frac{m}{m-k}.\] Therefore, \[{\mathbb{E}}[\tau_m]=\sum_{k=1}^{m-1} \frac{m}{m-k}= m (1+\tfrac{1}{2}+\tfrac{1}{3}+\dots+\tfrac{1}{m-1}).\] By comparing it with the integral \(\int_1^m \frac{1}{x}\, dx\), it is possible to conclude that \(H_m=1+\tfrac{1}{2}+\dots+\tfrac{1}{m-1}\) behaves like \(\log m\), i.e., that \[\lim_{m\to\infty} \frac{H_m}{\log m} = 1.\] Therefore \({\mathbb{E}}[\tau_m] \sim f(m)\), where \(f(m) = m \log m\).
We start with two cups, call them \(A\) and \(B\). Cup \(A\) contains \(12\) oz of milk, and cup \(B\) \(12\) oz of water. The following procedure is then performed twice: first, half of the content of the glass \(A\) is transferred into class \(B\). Then, the contents of glass \(B\) are thoroughly mixed, and a third of its entire content transferred back to \(A\). Finally, the contents of the glass \(A\) are thoroughly mixed. What is the final amount of milk in glass A? What does this have to do with Markov chains?
If there are \(a\) oz of milk and \(b\) oz of water in the glass \(A\) at time \(n\) (with \(a+b=12\)), then there are \(b\) oz of milk and \(a\) oz of water in the glass \(B\). After half of the content of glass \(A\) is moved to \(B\), it will contain \(b+\tfrac{1}{2}a\) oz of milk and \(a+\tfrac{1}{2}b\) oz of water. Transferring a third of that back to \(a\) leaves \(B\) with \((2/3 b + 1/3 a)\) oz of milk and \((2/3 a + 1/3 b)\) oz of water. Equivalently, \(A\) contains \((2/3 a + 1/3 b)\) oz of milk and \((1/3 a + 2/3 b)\) oz of water. This corresponds to the action of a Markov chain with the transition matrix \(P = \begin{bmatrix} 2/3 & 1/3 \\ 1/3 & 2/3 \end{bmatrix}\). We get the required amounts by computing \[\begin{aligned} (12,0) P^2 = (12,0) \begin{bmatrix} 5/9 & 4/9 \\ 4/9 & 5/9\end{bmatrix} = (20/3, 16/3).\end{aligned}\]
The state space of a Markov chain is \(S = \{1,2,3,4,5\}\), and the non-zero transition probabilities are given by \(p_{11} = 1/2\), \(p_{12}=1/2\), \(p_{23}=p_{34}=p_{45}=p_{51}=1\). Compute \(p^{(6)}_{12}\) without using software.
As you can see from the transition graph below
You can go from \(1\) to \(2\) in \(6\) steps in exactly two ways: \[1 \to 2 \to 3 \to 4 \to 5 \to 1 \to 2\] and \[1 \to 1 \to 1 \to 1 \to 1 \to 1 \to 2\] The probability of the first path is \(2^{-2}\) and the probability of the second path is \(2^{-6}\) - they add up to \(\tfrac{17}{64}\).
A Markov chain has four states \(1\), \(2\), \(3\) and \(4\) and the following transition probabilities (the ones not listed are \(0\)) \[\begin{align} p_{11} = 1/3, \, p_{12} = 1/3, \, p_{13} = 1/3, \, p_{22} = 1, \\ p_{33} = 1/2, \, p_{34} = 1/2, \, p_{44}=1. \end{align}\]
Sketch the transition graph of this chain.
Compute the conditional probability \({\mathbb{P}}[ X_5 = 4 | X_1 = 1]\).
Compute the conditional probability \({\mathbb{P}}[ X_{20} = 3 | X_1 = 3]\).
Suppose that each of the \(4\) states is equally likely to be the initial state (i.e., the value of \(X_0\)). Compute the probability \({\mathbb{P}}[X_1 = 4]\).
Here is the transition graph of the chain:
where orange edges have probability \(1/2\), green \(1/3\) and black \(1\).
This is the probability that, starting from \(1\), we will be at \(4\) in exactly \(4\) steps. To compute it, we enumerate all possible trajectories from \(1\) to \(4\) of length \(4\), and then add their probabilities. These trajectories, and their probabilities, are given in the table below:
| trajectory | probability |
|---|---|
| \(1 {\rightarrow}1 {\rightarrow}1 {\rightarrow}3 {\rightarrow}4\) | \(\Big(\frac{1}{3}\Big)^3 \times \Big( \frac{1}{2} \Big)\) |
| \(1 {\rightarrow}1 {\rightarrow}3 {\rightarrow}3 {\rightarrow}4\) | \(\Big(\frac{1}{3}\Big)^2 \times \Big( \frac{1}{2} \Big)^2\) |
| \(1 {\rightarrow}1 {\rightarrow}3 {\rightarrow}4 {\rightarrow}4\) | \(\Big(\frac{1}{3}\Big)^2 \times \Big( \frac{1}{2} \Big) \times 1\) |
| \(1 {\rightarrow}3 {\rightarrow}3 {\rightarrow}3 {\rightarrow}4\) | \(\Big(\frac{1}{3}\Big)\times \Big( \frac{1}{2} \Big)^3\) |
| \(1 {\rightarrow}3 {\rightarrow}3 {\rightarrow}4 {\rightarrow}4\) | \(\Big(\frac{1}{3}\Big)\times \Big( \frac{1}{2} \Big)^2 \times 1\) |
| \(1 {\rightarrow}3 {\rightarrow}4 {\rightarrow}4 {\rightarrow}4\) | \(\Big(\frac{1}{3}\Big)\times \Big( \frac{1}{2} \Big) \times 1^3\) |
Therefore, the required probability is \[\begin{align} \Big(\frac{1}{3}\Big)^3 \times \Big( \frac{1}{2} \Big) + \Big(\frac{1}{3}\Big)^2 \times \Big( \frac{1}{2} \Big)^2 + \Big(\frac{1}{3}\Big)^2 \times \Big( \frac{1}{2} \Big)+ \Big(\frac{1}{3}\Big)\times \Big( \frac{1}{2} \Big)^3 + \Big(\frac{1}{3}\Big)\times \Big( \frac{1}{2} \Big)^2 + \Big(\frac{1}{3}\Big)\times \Big( \frac{1}{2} \Big). \end{align}\]
It evaluates to \(85/216\), or, approximately \(0.394\)
Once you leave state \(3\) there is no coming back. Therefore, the only way to be there \(19\) steps later is for all \(19\) steps to be \(3 {\rightarrow}3\). The probability of the \(3 {\rightarrow}3\) transition is \(1/2\), so the required probability is \((1/2)^{19}\).
We use the law of total probability: \[\begin{align} {\mathbb{P}}[ X_1 = 4] &= {\mathbb{P}}[ X_1 = 4| X_0 = 1]\times {\mathbb{P}}[ X_0=1] + {\mathbb{P}}[ X_1 = 4| X_0 = 2]\times {\mathbb{P}}[X_0=2]\\ &+ {\mathbb{P}}[ X_1 = 4| X_0 = 3]\times {\mathbb{P}}[X_0=3] + {\mathbb{P}}[ X_1 = 4| X_0 = 4]\times {\mathbb{P}}[X_0=4]\\ &= p_{14}\times \frac{1}{4} + p_{24}\times \frac{1}{4} + p_{34}\times \frac{1}{4} + p_{44}\times \frac{1}{4} = \frac{1}{2}\times \frac{1}{4} + 1\times \frac{1}{4} = \frac{3}{8} \end{align}\]
In a Gambler’s ruin problem with the state space \(S=\{0,1,2,3,4\}\) and the probability \(p=1/3\) of winning in a single game, compute the \(4\)-step transition probabilities \[p^{(4)}_{2 2} = {\mathbb{P}}[ X_{n+4}=2| X_n =2] \text{ and } p^{(4)}_{2 4} = {\mathbb{P}}[ X_{n+4}=4| X_n =2].\]
There are four \(4\)-step trajectories that start in \(2\) and end in \(2\), with positive probabilities (remember, once you hit \(0\) or \(4\) you get stuck there), namely \[\begin{aligned} & 2 {\rightarrow}1 {\rightarrow}2 {\rightarrow}1 {\rightarrow}2, \quad 2 {\rightarrow}1 {\rightarrow}2 {\rightarrow}3 {\rightarrow}2, \quad \\ & 2 {\rightarrow}3 {\rightarrow}2 {\rightarrow}1 {\rightarrow}2, \quad 2 {\rightarrow}3 {\rightarrow}2 {\rightarrow}3 {\rightarrow}2.\end{aligned}\] Each has probability \((1/3)\times(2/3)\times(1/3)\times(2/3) = 4/81\) so the total probability is \(16/81\).
The (possible) trajectories that go from \(2\) to \(4\) in exactly 4 steps are \[\begin{aligned} 2 {\rightarrow}1 {\rightarrow}2 {\rightarrow}3 {\rightarrow}4, \quad 2 {\rightarrow}3 {\rightarrow}2 {\rightarrow}3 {\rightarrow}4\ \text{ and }\ 2 {\rightarrow}3 {\rightarrow}4 {\rightarrow}4 {\rightarrow}4.\end{aligned}\] The first two have the same probability, namely \((2/3)\times(1/3)\times(2/3)\times(2/3) = 8/81\), and the third one \((1/3)\times(2/3)\times(1)\times(1) = 18/81\) so \(p^{(4)}_{24} = 26/81\).
A car-insurance company classifies drivers in three categories: bad, neutral and good. The reclassification is done in January of each year and the probabilities for transitions between different categories is given by \[P= \begin{bmatrix} 1/2 & 1/2 & 0 \\ 1/5 & 2/5 & 2/5 \\ 1/5 & 1/5 & 3/5\end{bmatrix},\] where the first row/column corresponds to the bad category, the second to neutral and the third to good. The company started in January 1990 with 1400 drivers in each category. Estimate the number of drivers in each category in 2090. Assume that the total number of drivers does not change in time and use R for your computations.
Equal numbers of drivers in each category corresponds to the uniform initial distribution, \(a^{(0)}=(1/3,1/3,1/3)\). The distribution of drivers in 2090 is given by the distribution \(a^{(100)}\) of \(X_{100}\) which is, in turn, given by \[a^{(100)}= a^{(0)} P^{100}.\] Finally, we need to compute the number of drivers in each category, so we multiply the result by the total number of drivers, i.e., \(3 \times 1400 = 4200\):
P = matrix(
c(1/2 , 1/2 , 0,
1/5 , 2/5 , 2/5 ,
1/5 , 1/5 , 3/5),
byrow=T, ncol=3)
# a0 needs to be a row matrix
a0 = matrix(c(1/3, 1/3, 1/3), nrow=1)
P100 = diag(3) # the 3x3 identity matrix
for (i in 1:100)
P100 = P100 %*% P
(a0 %*% P100) * 4200
## [,1] [,2] [,3]
## [1,] 1200 1500 1500
Note: if you think that computing matrix powers using for loops is in poor taste, there are several R packages you can use. Have a look at this post if you are curious.
A zoologist, Dr. Gurkensaft, claims to have trained Basil the Rat so that it can avoid being shocked and find food, even in highly confusing situations. Another scientist, Dr. Hasenpfeffer does not agree. She says that Basil is stupid and cannot tell the difference between food and an electrical shocker until it gets very close to either of them.
The two decide to see who is right by performing the following experiment. Basil is put in the compartment \(3\) of a maze that looks like this:
Dr. Gurkensaft’s hypothesis is that, once in a compartment with \(k\) exits, Basil will prefer the exits that lead him closer to the food. Dr. Hasenpfeffer’s claim is that every time there are \(k\) exits from a compartment, Basil chooses each one with probability \(1/k\).
After repeating the experiment 100 times, Basil got shocked before getting to food \(52\) times and he reached food before being shocked \(48\) times.
Create an Markov chain that models this situation (draw a transition graph and mark the edges with their probabilities).
Use Monte Carlo to estimate the probability of being shocked before getting to food, under the assumption that Basil is stupid (all exits are equally likely).
Btw, who do you think is right? Whose side is the evidence (48 vs. 52) on? If you know how to perform an appropriate statistical test here, do it. If you don’t simply state what you think.
Basil’s behavior can be modeled by a Markov Chain with states corresponding to compartments, and transitions to their adjacency. The graph of such a chain, on the state space \(S=\{1,2,3,4,5,F,S\}\) would look like this (with black = \(1\), orange = \(1/2\) and green=\(1/3\))
To be able to do Monte Carlo, we need to construct its transition matrix. Since there are far fewer transitions than pairs of states, it is a good idea to start with a matrix of \(0\)s and then fill in the non-zero values. We also decide that \(F\) and \(S\) will be given the last two rows/columns, i.e., numbers \(6\) and \(7\):
P = matrix(0,nrow =7, ncol=7 )
P[1,2] = 1/2; P[1,3] = 1/2;
P[2,1] = 1/3; P[2,4] = 1/3; P[2,6] = 1/3;
P[3,1] = 1/3; P[3,4] = 1/3; P[3,7] = 1/3;
P[4,2] = 1/3; P[4,3] = 1/3; P[4,5] = 1/3;
P[5,4] = 1/2; P[5,6] = 1/2;
P[6,6] = 1
P[7,7] = 1
We continue by simulating nsim = 1000 trajectories of
this chain, starting from the state \(3\). We compress and reuse the code from
section @ref(mc-sim) above:
T = 100 # number of time periods
nsim = 1000 # number of simulations
single_trajectory = function(i) {
path = numeric(T)
last = i
for (n in 1:T) {
path[n] = sample(1:7, prob = P[last, ], size = 1)
last = path[n]
}
return(path)
}
df = data.frame(X0 = 3, t(replicate(nsim, single_trajectory(3))))
(p_shocked = mean(df$X100 == 7))
## [1] 0.58
So, the probability of being shocked first is about \(0.58\). To be honest, what we computed up here is not \({\mathbb{P}}[X_{\tau_{S,F}} = S]\), as the problem required, but the probability \({\mathbb{P}}[ X_{100} = S]\). In general, these are not the same, but because both \(S\) and \(F\) are absorbing states, the events \(X_{100}=S\) and \(X_{\tau_{S,F}} = S\) differ only on the event where \(\tau_{F,S}>100\), i.e., when Basil has not been either shocked or fed after \(100\) steps.
To see what kind of an error we are making, we can examine the empirical distribution of \(X_{100}\) across our \(1000\) samples:
table(df$X100)
##
## 6 7
## 419 581
and conclude that, on this particular set of simulations, \(\tau_{S,F}\leq 100\), so no error has been
made at all. In general, approximations like this are very useful in
cases where we can expect the probability of non-absorption within a
given time interval to be negligible. On the other hand, if you examine
a typical trajectory of df, you will see that most of the
time it takes the value \(6\) of \(7\), so a lot of the computational effort
goes to waste. But don’t worry about such things in this
course.
So, is this enough evidence to conclude that Basil is, in fact, a smart rat? On one hand, the obtained probability \(0.58\) is somewhat higher than Basil’s observed shock rate of \(52\%\), but it is not clear just from those numbers are not due to simple luck of the draw, and not Basil’s alleged intelligence. Without doing any further statistical analysis, my personal guess would be “probably, but who knows”.
For those of you who know a bit of statistics: one can apply the binomial test (or, more precisely, its large-sample approximation) to test against the null hypothesis that Basil is stupid. Under the null, the number of times Basil will get shocked in 100 experiments is binomial, with parameters \(n=100\) and \(p=0.581\). Its normal approximation is \(N(np, \sqrt{np(1-p)}) = N(58.1, 4.934)\), so the \(z\)-score of the observed value, i.e., \(52\), is \(z = \tfrac{ 52 - 58.1}{ 4.934} = -1.236\). The standard normal CDF at \(z=-1.236\) is about \(0.11\), i.e., the \(p\)-value is about \(0.11\). That means that, by chance alone, a truly stupid rat would appear at least as smart as Basil in about \(11\%\) of experiments identical to the one described above. This kind of evidence is usually not considered sufficient to make a robust conclusion about Basil’s intelligence.
A math professor has \(4\) umbrellas. He keeps some of them at home and some in the office. Every morning, when he leaves home, he checks the weather and takes an umbrella with him if it rains. In case all the umbrellas are in the office, he gets wet. The same procedure is repeated in the afternoon when he leaves the office to go home. The professor lives in a tropical region, so the chance of rain in the afternoon is higher than in the morning; it is \(1/5\) in the afternoon and \(1/20\) in the morning. Whether it rains of not is independent of whether it rained the last time he checked.
On day \(0\), there are \(2\) umbrellas at home, and \(2\) in the office.
Construct a Markov chain that models the situation.
Use Monte Carlo to give an approximate answer the following questions:
We model the situation by a Markov chain whose stats are all of the form h\(m\)-\(n\) or \(m\)-\(n\)o, or “Wet” where h\(m\)-\(n\) means that the professor is at home, there are \(m\) umbrellas at home and \(n\) umbrellas at the office. Similarly, \(m\)-\(n\)o means that there are \(m\) umbrellas at home, \(n\) at the office and the professor is at the office.
The transitions between the states are simple to figure out. For example, from the state h\(4\)-\(1\) the professor will move to the state \(4\)-\(1\)o with probability \(19/20\) (if it does not rain) and \(3\)-\(2\)o with probability \(1/20\)? (if it rains). The professor can move to the state “Wet” only from h\(0\)-\(5\) (with probability \(1/20\)) or from \(5\)-\(0\)o (with probability \(1/5\)), and the state “Wet” is made absorbing. Here is the full transition graph, color-coded as follows: green is \(1/20\), pink is \(19/20\), orange is \(1/5\) and purple is \(4/5\):
The state “Wet” is absorbing, and, therefore, constitutes a one-element recurrent class. All the other states belong to a separate, transient, class. The periods of the state “Wet” is 1, while the periods of all other states are \(2\).
We prepare for Monte Carlo by building the state-space vector
S and the transition matrix P for this
chain.
S= c("h0-4", "h1-3", "h2-2", "h3-1", "h4-0",
"0-4o", "1-3o", "2-2o", "3-1o", "4-0o", "Wet")
P = matrix(c(
0, 0, 0, 0, 0, 0.95, 0, 0, 0, 0, 0.05,
0, 0, 0, 0, 0, 0.05, 0.95, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0.05, 0.95, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0.05, 0.95, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0.05, 0.95, 0,
0.8, 0.2, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0.8, 0.2, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0.8, 0.2, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0.8, 0.2, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0.8, 0, 0, 0, 0, 0, 0.2,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ),
byrow=T, ncol = 11 )We reuse the code from the notes to simulate
nsim=1000 paths of length T=300, - it is
almost exactly the same as before, but we include it here for
completion:
T = 300 # number of time periods
nsim = 1000 # number of simulations
single_trajectory = function(i) {
path = numeric(T)
last = i
for (n in 1:T) {
path[n] = sample(1:11, prob = P[last, ], size = 1)
last = path[n]
}
return(path)
}
i0 = 3 # the initial state is 'h2-2' which is at position 3 in S
d = data.frame(X0 = i0, t(replicate(nsim, single_trajectory(i0))))
First, we check that all \(1000\)
trajectories reached the state 11 (“Wet”) during the first
T=300 steps:
table(d$X300)
##
## 11
## 1000
Good. Next, we need to find the first time a given trajectory hits 11
(the number of the state “Wet”). This can be done in several ways but
perhaps the easiest is by combining built-in functions
match (which finds the first occurrence of an element in a
vector) and apply (which applies a function to each
row/column in a matric/data.frame):
tau = apply(d, 1, function(x) match(11, x))
mean(tau)
## [1] 39
So we get that the expected number of trips before the professor gets wet is about \(39\), which is about \(19.5\) days.
The second question can be answered in many ways. One possibility
is to split the state “Wet” into two states, depending on the location
the professor left just before he got wet. A simpler possibility is
check the number of the state with index tau-1 in the data
frame d above:
last_locations = integer(nsim)
for (i in 1:nsim) last_locations[i] = d[i, tau[i] - 1]
table(last_locations)
## last_locations
## 1 10
## 10 990
The states h\(0\)-\(4\) and \(0\)-\(4\)o
have numbers \(1\) and \(10\) in the vector S of
states. Therefore, the table obtained above tells us that just before
getting wet the professor left his home in \(10\) draws out of \(1000\), and his office in the remaining
\(990\) draws. Hence, the Monte-Carlo
estimate of the required probability is \(0.99\).
There is another, computationally simpler, but conceptually more tricky way of obtaining this estimate. Since there are \(11\) states in the chain, the “Wet” state can only be reached from two states with numbers \(1\) and \(10\) and all states (except for “Wet”) have period \(2\), we have the following dichotomy:
Therefore, it is enough to count the number of even and odd elements
in tau:
tau_mod_2 = tau %% 2
table(tau_mod_2)
## tau_mod_2
## 0 1
## 10 990
We get the same probability estimate as above.
There will be a lot of definitions and some theory before we get to examples. You might want to peek ahead as notions are being introduced; it will help your understanding.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a Markov chain on the state space \(S\). For a given set \(B\) of states, define the (first) hitting time \(\tau_B\) (or \(\tau(B)\) if subscripts are impractical) of the set \(B\) as \[\begin{equation} \tau_B=\min \{ n\in{\mathbb{N}}_0\, : \, X_n\in B\}. \end{equation}\] We know that \(\tau_B\) is, in fact, a stopping time with respect to \(\{X_n\}_{n\in {\mathbb{N}}_0}\). When \(B\) consists of only one element , e.g. \(B=\{i\}\), we simply write \(\tau_{i}\) for \(\tau_{\{i\}}\); \(\tau_{i}\) is the first time the Markov chain \(\{X_n\}_{n\in {\mathbb{N}}_0}\) “hits” the state \(i\). As always, we allow \(\tau_{B}\) to take the value \(\infty\); it means that no state in \(B\) is ever hit.
The hitting times are important both for applications, and for better understanding of the structure of Markov chains in general. For example, let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be the chain which models a game of tennis (from the previous lecture). The probability of winning for Player 1 can be phrased in terms of hitting times: \[{\mathbb{P}}[ \text{Player 1 wins}]={\mathbb{P}}[ \tau_{i_{1}}<\tau_{i_{2}}],\] where \(i_{1}=\) “Player 1 wins” and \(i_{2}=\)“Player 2 wins” (the two absorbing states of the chain). We will learn how to compute such probabilities in the subsequent lectures.
Having introduced the hitting times \(\tau_B\), let us give a few more definitions. It will be very convenient to consider the same Markov chain with different initial distributions. Most often, these distributions will correspond to starting from a fixed state (as opposed to choosing the initial state at random). We use the notation \({\mathbb{P}}_i[A]\) to mean \({\mathbb{P}}[A|X_0=i]\) (for any event \(A\)), and \({\mathbb{E}}_i[A]={\mathbb{E}}[A|X_0=i]\) (for any random variable \(X\)). In practice, we use \({\mathbb{P}}_i\) and \({\mathbb{E}}_i\) to signify that we are starting the chain from the state \(i\), i.e., \({\mathbb{P}}_i\) corresponds to a Markov chain whose transition matrix is the same as the one of \(\{X_n\}_{n\in {\mathbb{N}}_0}\), but the initial distribution is given by \({\mathbb{P}}_i[X_0=j]=0\) if \(j\not = i\) and \({\mathbb{P}}_i[X_0=i]=1\). Note also that \({\mathbb{P}}_i[X_1=j] = p_{ij}\) and that \({\mathbb{P}}_i[X_n=j] =p^{(n)}_{ij}\), for any \(n\).
A state \(i\in S\) is said to communicate with the state \(j\in S\), denoted by \(i{\rightarrow}j\) if \[{\mathbb{P}}_i[\tau_{j}<\infty]>0.\]
Intuitively, \(i\) communicates with \(j\) if there is a non-zero chance that the Markov chain \(X\) will eventually visit \(j\) if it starts from \(i\). Sometimes we also say that \(j\) is a consequent of \(i\), that \(j\) is accessible from \(i\), or that \(j\) follows \(i\).
In the “tennis” example of the previous chapter, every state is accessible from \((0,0)\) (the fact that \(p\in (0,1)\) is important here), but \((0,0)\) is not accessible from any other state. The consequents of \((0,0)\) are not only \((15,0)\) and \((0,15)\), but also \((30,15)\) or \((40,40)\). In fact, all states are consequents of \((0,0)\). The consequents of \((40,40)\) are \((40,40)\) itself, \((40,Adv)\), \((Adv, 40)\), “P1 wins” and “P2 wins”.
Explain why \(i {\rightarrow}j\) if and only if \(p^{(n)}_{ij}>0\) for some \(n\in{\mathbb{N}}_0\).
Leaving a rigorous mathematical proof aside, we note that the statement is intuitively easy to understand. If \(i{\rightarrow}j\) then there must exist some time \(n\) such that \({\mathbb{P}}_i[\tau_j = n]>0\). This, in turn, implies that it is possible to go from \(i\) to \(j\) in exactly \(n\) steps, where “possible” means “with positive probability”. In our notation, that is exactly what \(p^{(n)}_{ij}>0\) means.
Conversely, if \(p^{(n)}_{ij}>0\) then \({\mathbb{P}}_i[ \tau_j <\infty] \geq {\mathbb{P}}_i[\tau_j \leq n] \geq {\mathbb{P}}_i[ X_n = j]=p^{(n)}_{ij}>0.\)
Two immediate properties of the relation \({\rightarrow}\) are listed in the problem below:
Explain why the following statements are true for all states \(i,j,k\) of a Markov chain.
\(i{\rightarrow}i\),
\(i{\rightarrow}j, j{\rightarrow}k\) implies \(i {\rightarrow}k\).
If we start from state \(i\in S\) we are already there! More rigorously, note that \(0\) is allowed as a value for \(\tau_{B}\) in its definition above, i.e., \(\tau_i=0\) when \(X_0=i\).
Intuitively, if you can follow a path (sequence of arrows) from \(i\) to \(j\), and then another path \(j\) to \(k\), you can do the same from \(i\) to \(k\) by concatenating two paths. More rigorously, by the previous problem, it will be enough to show that \(p^{(n)}_{ik}>0\) for some \(n\in{\mathbb{N}}\). By the same Proposition, we know that \(p^{(n_1)}_{ij}>0\) and \(p^{(n_2)}_{jk}>0\) for some \(n_1,n_2\in{\mathbb{N}}_0\). By the Chapman-Kolmogorov relations, with \(n=n_1+n_2\), we have \[\begin{equation} p^{(n)}_{ik} =\sum_{l\in S} p^{(n_1)}_{il} p^{(n_2)}_{lk}\geq p^{(n_1)}_{ij} p^{(n_2)}_{jk}>0. \end{equation}\] Note that the inequality \(p^{(n)}_{ik}\geq p^{(n_1)}_{il}p^{(n_2)}_{lk}\) is valid for all \(i,l,k\in S\), as long as \(n_1+n_2=n\). It will come in handy later.
Remember that the greatest common divisor (gcd) of a set \(A\) of natural numbers if the largest number \(d\in{\mathbb{N}}\) such that \(d\) divides each \(k\in A\), i.e., such that each \(k\in A\) is of the form \(k=l d\) for some \(l\in{\mathbb{N}}\).
A period \(d(i)\) of a state \(i\in S\) is the greatest common divisor of the return set \[R(i)= \{ n\in{\mathbb{N}}\, : \, p^{(n)}_{ii}>0\}\] of the state \(i\). When \(R(i)=\emptyset\), we set \(d(i)=1\). A state \(i\in S\) is called aperiodic if \(d(i)=1\).
Consider two Markov chains with three states and the transition matrices \[P_1=\begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}, \quad P_2=\begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1 \\ \tfrac{1}{2} & 0 & \tfrac{1}{2} \end{bmatrix}\]
Find return sets and periods of each state \(i\) of each chain.
For the first chain, with transition graph
the return set for each state \(i\in\{1,2,3\}\) is given by \(R(i)= \{3,6,9,12,\dots\}\), so \(d(i)=3\) for all \(i\in\{1,2,3\}\).
Even though the transition graph of the second chain looks very similar to the first one
the situation changes drastically: \[\begin{align} R(1) & =\{ 3,4,5,6, \dots \},\\ R(2) & =\{ 2,3,4,5,6, \dots \},\\ R(3) & =\{ 1,2,3,4,5,6, \dots \}, \end{align}\] so that \(d(i)=1\) for \(i\in\{1,2,3\}\).
We say that the states \(i\) and \(j\) in \(S\) intercommunicate, denoted by \(i\leftrightarrow j\) if \(i{\rightarrow}j\) and \(j{\rightarrow}i\). A set \(B\subseteq S\) of states is called irreducible if \(i\leftrightarrow j\) for all \(i,j\in S\).
Unlike the relation of communication, the relation of intercommunication is symmetric. Moreover, we have the following immediate property: the relation \(\leftrightarrow\) is an equivalence relation on \(S\), i.e., for all \(i,j,k\in S\), we have
\(i\leftrightarrow i\) (reflexivity) ,
\(i\leftrightarrow j\) implies \(j\leftrightarrow i\) (symmetry), and
\(i\leftrightarrow j, j\leftrightarrow k\) implies \(i\leftrightarrow k\) (transitivity).
The fact that \(\leftrightarrow\) is an equivalence relation allows us to split the state-space \(S\) into equivalence classes with respect to \(\leftrightarrow\). In other words, we can write \[S=S_1\cup S_2\cup S_3\cup \dots,\] where \(S_1, S_2, \dots\) are mutually exclusive (disjoint) and all states in a particular \(S_n\) intercommunicate, while no two states from different equivalence classes \(S_n\) and \(S_m\) do. The sets \(S_1, S_2, \dots\) are called classes of the chain \(\{X_n\}_{n\in {\mathbb{N}}_0}\). Equivalently, one can say that classes are maximal irreducible sets, in the sense that they are irreducible and no class is a subset of a (strictly larger) irreducible set. A cookbook algorithm for class identification would involve the following steps:
Start from an arbitrary state (call it \(1\)).
Identify all states \(j\) that intercommunicate with it (\(1\), itself, always does).
That is your first class, call it \(C_1\). If there are no elements left, then there is only one class \(C_1=S\). If there is an element in \(S\setminus C_1\), repeat the procedure above starting from that element.
The notion of a class is especially useful in relation to another natural concept: A set \(B\subseteq S\) of states is said to be closed if \(i\not{\rightarrow}j\) for all \(i\in B\) and all \(j\in S\setminus B\). In words, \(B\) is closed if it is impossible to get out of. A state \(i\in S\) such that the set \(\{i\}\) is closed is called absorbing.
Show that a set \(B\) of states is closed if and only if \(p_{ij}=0\) for all \(i\in B\) and all \(j\in B^c=S\setminus B\).
Suppose, first, that \(B\) is closed. Then for \(i\in B\) and \(j\in B^c\), we have \(i\not{\rightarrow}j\), i.e., \(p^{(n)}_{ij}=0\) for all \(n\in{\mathbb{N}}\). In particular, \(p_{ij}=0\).
Conversely, suppose that \(p_{ij}=0\) for all \(i\in B\), \(j\in B^c\). We need to show that \(k\not{\rightarrow}l\) (i.e. \(p^{(n)}_{kl}=0\) for all \(n\in{\mathbb{N}}\)) for all \(k\in B\), \(l\in B^c\). Suppose, to the contrary, that there exist \(k\in B\) and \(l\in B^c\) such that \(p^{(n)}_{kl}>0\) for some \(n\in {\mathbb{N}}\). That means that we can find a sequence of states \[k=i_0, i_1, \dots, i_n=l \text{ such that } p_{i_{m-1} i_{m}}>0 \text{ forall }m = 1,\dots, n.\] The first state, \(k=i_0\) is in \(B\) and the last one, \(l=i_n\), is in \(B^c\). Therefore there must exist an index \(m\) such that \(i_{m-1}\in B\) but \(i_{m}\in B^c\). We also know that \(p_{i_m i_{m+1}}>0\), which is in contradiction with out assumption that \(p_{ij}=0\) for all \(i\in B\) and \(j\in B^c\).
Intuitively, a set of states is closed if it has the property that the chain \(\{X_n\}_{n\in {\mathbb{N}}_0}\) stays in it forever, once it enters it. In general, if \(B\) is closed, it does not have to follow that \(S\setminus B\) is closed. Also, a class does not have to be closed, and a closed set does not have to be a class. Here is an example - consider the following three sets of states in the tennis chain of the previous lecture and:
\(B=\{\text{"P1 wins"}\}\): closed and a class, but \(S\setminus B\) is not closed
\(B=S\setminus \{(0,0)\}\): closed, but not a class, and
\(B=\{(0,0)\}\): class, but not closed.
Not everything is lost as the following relationship always holds:
Show that every closed set \(B\) is a union of one or more classes.
Let \(\hat{B}\) be the union of all classes \(C\) such that \(C\cap B\not=\emptyset\). In other words, take all the elements of \(B\) and throw in all the states which intercommunicate with at least one of them. I claim that \(\hat{B}=B\). Clearly, \(B\subset \hat{B}\), so we need to show that \(\hat{B}\subseteq B\). Suppose, to the contrary, that there exists \(j\in \hat{B}\setminus B\). By construction, \(j\) intercommunicates with some \(i\in B\). In particular \(i{\rightarrow}j\). By the closedness of \(B\), we must have \(j\in B\). This is a contradiction with the assumptions that \(j\in \hat{B}\setminus B\).
Note that the converse is not true: just take the set \(B=\{ (0,0), (0,15)\}\) in the “tennis” example. It is a union of two classes, but it is not closed.
It is often important to know whether a Markov chain will ever return to its initial state, and if so, how often. The notions of transience and recurrence are used to address this questions.
We start by introducing a cousin \(T_j(1)\) of the first hitting time \(\tau_1\). The (first) visit time to state \(j\), denoted by \(T_j(1)\) is defined as \[T_j(1) = \min \{ n\in{\mathbb{N}}\, : \, X_n=j\}.\] As usual \(T_j(1)=\infty\) if \(X_n\not = j\) for all \(n\in{\mathbb{N}}\). Similarly, second, third, etc., visit times are defined as follows: \[\begin{aligned} T_j(2) &= \min \{ n>T_j(1)\, : \, X_n=j\}, \\ T_j(3) &= \min \{ n>T_j(2)\, : \, X_n=j\}, \text{ etc., }\end{aligned}\] with the understanding that if \(T_j(n)=\infty\), then also \(T_j(m)=\infty\) for all \(m>n\).
Note that the definition of the random variable \(T_j(1)\) differs from the definition of \(\tau_j\) in that the minimum here is taken over the set \({\mathbb{N}}\) of natural numbers, while the set of non-negative integers \({\mathbb{N}}_0\) is used for \(\tau_j\). When \(X_0\not = j\), the hitting time \(\tau_j\) and the first visit time \(T_j(1)\) coincide. The important difference occurs only when \(X_0=j\). In that case \(\tau_j=0\) (we are already there), but it is always true that \(T_j(1)\geq 1\). It can even happen that \({\mathbb{P}}_j[T_j(1)=\infty]=1\). If you want an example, take any state in the deterministically monotone chain.
A state \(i\in S\) is said to be
recurrent if \({\mathbb{P}}_i[T_i(1)<\infty]=1\),
positive recurrent if \({\mathbb{E}}_i[T_i(1)]<\infty\)
null recurrent if it is recurrent, but not positive recurrent,
transient if it is not recurrent.
A state is recurrent if we are sure we will come back to it eventually (with probability 1). It is positive recurrent if it is recurrent and the time between two consecutive visits has finite expectation. Null recurrence means the we will return, but the waiting time may be very long. A state is transient if there is a positive chance (however small) that the chain will never return to it.
The definition of recurrence from above is conceptually simple, but it gives us no clue about how to actually go about deciding whether a particular state in a specific Markov chain is recurrent. A criterion stated entirely in terms of the transition matrix \(P\) would be nice. Before we give it, we need to introduce some notation. and prove an important theorem. Given a state \(i\), let \(f_i\) denote the probability that the chain will visit \(i\) again, if it starts there, i.e., \[f_i = {\mathbb{P}}_i[ T_i(1) < \infty].\] Clearly, \(i\) is recurrent if and only if \(f_i=1\).
The interesting thing is that every time our chain visits the state \(i\), its future evolution is independent of the past (except for the name of the current state) and it behaves exactly like a new and independent chain started from \(i\) would. This is a special case of so-called strong Markov property which states that the (usual) Markov property also holds at stopping times (and not only fixed times \(n\)). We will not prove this property in these notes, but we will gladly use it to prove the following dichotomy:
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a Markov chain on a countable state space \(S\), with the (deterministic) initial state \(X_0=i\). Then exactly one of the following two statements hold with probability 1:
either the chain will return to \(i\) infinitely many times, or
the chain will return to \(i\) a finite number \(N_i\) of times, where \(N_i\) is geometrically distributed random variable with parameter \(f_i\), where \(f_i={\mathbb{P}}_i[T_i(1)<\infty]\).
In the first case, \(i\) is recurrent and, in the second, it is transient.
If \(f_i=1\), then \(X\) is guaranteed to return to \(i\) at least once. When that happens, however, the strong Markov property “deletes” the past, and the process “renews” itself. This puts us back in the original situation where we are looking at a chain which starts at \(i\) and is guaranteed to return there at least once. Continuing like that, we get a whole infinite sequence of stopping times \[T_i(1) < T_i(2) < \dots\] at which \(X\) finds itself at \(i\).
If \(f_i<1\), a similar story can be told, but with a significant difference. Every time \(X\) returns to \(i\), there is a probability \(1-f_i\) that it will never come back to \(i\), and, this is independent of the past behavior. If we think of the return to \(i\) as a success, the number of successes before the first failure, i.e., the number of return visits to \(i\), is nothing but a geometrically distributed random variable with parameter \(f_i\).
The following interesting fact follows (almost) directly from the Return Theorem:
Suppose that the state space \(S\) is finite. Show that there exists at least one recurrent state.
We argue by contradiction and assume that all the states are transient. We claim that, in that case, the total number of visits \(N_i\) to each state \(i\) is always finite, no matter what state \(i_0\) we start from. Indeed, if \(i=i_0\) that is precisely the conclusion the Return Theorem above. For a state \(i\ne i_0\), the number of visits is either \(0\) - if we never even get to \(i\), or \(1+N_{i}\) if we do. In either case, it is a finite number (not \(\infty\)).
Since \(S\) is finite, it follows that the sum \(\sum_{i\in S} N_i\) is also finite - a contradiction with the fact that there are infinitely many time instances \(n\in{\mathbb{N}}_0\), and the fact that the chain must be in some state in each one of them.
If \(S\) is not finite, it is not true that recurrent states must exist. Just think of the Deterministically-Monotone Chain or the random walk with \(p\not=\tfrac{1}{2}\). All states are transitive there.
Perhaps the most important consequence of the Return Theorem is the following criterion for recurrence of Markov chains on finite or countable state spaces:
A state \(i\in S\) is recurrent if and only if \[\sum_{n\in{\mathbb{N}}} p^{(n)}_{ii}=\infty.\]
Let \(N_i\) denote the total number (finite or \(\infty\)) of visits to the state \(i\), with the initial visit at time \(0\) not counted. We can write \(N_i\) as an infinite sum as follows \[N_i = \sum_{n=1}^{\infty} \mathbf{1}_{\{X_n = i\}}.\] Taking the expectation yields \[{\mathbb{E}}[N_i] = {\mathbb{E}}_i[ \sum_{n=1}^{\infty} \mathbf{1}_{\{X_n=i\}}] = \sum_{n=1}^{\infty} {\mathbb{E}}_i[ \mathbf{1}_{\{X_n=i\}}] = \sum_{n=1}^{\infty} {\mathbb{P}}_i[ X_n=i] = \sum_{n=1}^{\infty} p^{(n)}_{ii},\] where we used the intuitively acceptable (but not rigorously proven) fact that \({\mathbb{E}}_i\) and an infinite sum can be switched.
If \(i\) is transient, i.e., if \(f_i<1\), the Return Theorem and the formula for the expected value of a geometric distribution imply that \[{\mathbb{E}}_i[N_i] = \frac{f_i}{1-f_i}<\infty, \text{ and so } \sum_{n=1}^{\infty} p^{(n)}_{ii} = {\mathbb{E}}_i[N_i]<\infty.\] On the other hand, if \(i\) is recurrent, the Return Theorem states that \(N_i=\infty\). Hence, \[\sum_{n=1}^{\infty} p^{(n)}_{ii}={\mathbb{E}}_i[N_i]=\infty,\] which is exactly what we had to prove.
Remark. The central idea behind the proof of the recurrence criterion is the following: we managed tell whether or not \(N_i = \infty\) by checking whether \({\mathbb{E}}[N_i]=\infty\) or not. This is, however, not something that can be done for any old random variable taking values in \({\mathbb{N}}_0 \cup \{\infty\}\). If \({\mathbb{E}}[N]<\infty\), then, clearly \({\mathbb{P}}[N=\infty]=0\) so that \(N\) only takes values in \({\mathbb{N}}_0\). On the other hand, it is not true that \({\mathbb{P}}[N=\infty]=0\) implies that \({\mathbb{E}}[N]<\infty\). It suffices to take a random variable with the following distribution \[{\mathbb{P}}[ N = n] = c/n^2 \text{ for }n\in{\mathbb{N}},\] where the constant \(c\) is chosen so that \(\sum_n c/n^2 =1\) (in fact, we can compute that \(c=6/\pi^2\) explicitly in this case). The expected value of \(N\) is given by \[{\mathbb{E}}[N] = \sum_{n=1}^{\infty} n {\mathbb{P}}[N=n] = c \sum_{n=1}^{\infty} \frac{1}{n} = \infty.\] The message is that, in general, you cannot detect whether something happened infinitely many times or not based only on its expectation.
Such a detection, however, becomes possible in the special case when \(N=N_i\) denotes the total number of returns to the state \(i\) of a Markov chain. This is exactly the content of proof of the Return Theorem above: each time the chain leaves \(i\), it comes back to it (or does not) with the same probability, independently of the past. This gives us extra information about the random variable \(N\) (namely that it is either infinite with probability \(1\) or geometrically distributed) and allows us to test its finiteness by using the expected value only.
Here is an application of our recurrence criterion - a beautiful and unexpected result of George Pólya from 1921.
In addition to the simple symmetric random walk on the line (\(d=1\)) we studied before, one can consider random walks whose values are in the plane (\(d=2\)), the space (\(d=3\)), etc. They defined as Markov Chains with the state space \(S={\mathbb{Z}}^d\) and the following transitions: starting from the state \((x_1,\dots, x_d)\), it picks one of its \(2d\) neighbors \[\begin{align} & (x_1+1,x_2, \dots, x_d), (x_1-1,x_2, \dots, x_d),\\ &(x_1, x_2+1,\dots, x_d), (x_1, x_2-1,\dots, x_d),\\ &... \\ &(x_1,x_2, \dots, x_d+1), (x_1,x_2, \dots, x_d-1)\end{align}\] randomly and uniformly and moves there. For illustration, here is a picture of a path of a two-dimensional random walk; as time progresses, the color of the edges goes from black to orange, edges traversed multiple times are darker, dots mark the position of the walk at time \(n=0\) (the black round dot) and at time \(n=1000\) (orange square dot):
Polya’s (and our) goal was to study the recurrence properties of the \(d\)-dimensional random walk. We already know that the simple symmetric random walk on \({\mathbb{Z}}\) is recurrent (i.e., every \(i\in {\mathbb{Z}}\) is a recurrent state). The easiest way to proceed when \(d\geq 2\) is to use the recurrence criterion we proved above. We start by estimating the values \(p^{(n)}_{ii}\), for \(n\in{\mathbb{N}}\). By symmetry, we can focus on the origin, i.e., it is enough to estimate, for each \(n\in{\mathbb{N}}\), the magnitude of \[p^{(n)}= p^{(n)}_{00}= {\mathbb{P}}_{0}[ X_n=(0,0,\dots, 0)].\] As we learned some time ago, this probability can be computed by counting all “trajectories” from \((0,\dots, 0)\) that return to \((0,\dots, 0)\) in \(n\) steps. First of all, it is clear that \(n\) needs to be even, i.e., \(n=2m\), for some \(m\in{\mathbb{N}}\). It helps if we think of any trajectory as a sequence of “increments” \(\xi_1,\dots, \xi_n,\) where each \(\xi_i\) takes its value in the set \(\{1,-1,2,-2,\dots, d, -d\}\). In words, \(\xi_i= +k\) if the \(k\)-th coordinate increases by \(1\) on the \(i\)-th step, and \(\xi_i=-k\), if the \(k\)-th coordinate decreases8
This way, the problem becomes combinatorial:
In how many ways can we put one element of the set \(\{1,-1,2,-2, \dots, d,-d\}\) into each of \(n=2m\) boxes so that the number of boxes with \(k\) in them equals to the number of boxes with \(-k\) in them?
To get the answer, we start by fixing a possible “count” \((i_1,\dots, i_d)\), satisfying \(i_1+\dots+i_d=m\) of the number of times each of the values in \(\{1,2,\dots, d\}\) occurs. These values have to be placed in \(m\) of the \(2m\) slots and their negatives (possibly in a different order) in the remaining \(m\) slots. So, first, we choose the “positive” slots (in \(\binom{2m}{m}\) ways), and then distribute \(i_1\) “ones”, \(i_2\) “twos”, etc., in those slots; this can be done in9 \[\binom{ m }{ i_1 i_2 \dots i_d}\] ways. This is also the number of ways we can distribute the negative “ones”, “twos”, etc., in the remaining slots. All in all, for fixed \(i_1,i_2,\dots, i_d\), all of this can be done in \[\binom{2m}{m} \binom{ m }{ i_1 i_2 \dots i_d}^2\] ways. Remembering that each path has the probability \((2d)^{-2m}\), and summing over all \(i_1,\dots, i_d\) with \(i_1+\dots+i_d=m\), we get \[\begin{equation} p^{(2m)} = \frac{1}{(2d)^{2m}} \binom{2m}{m} \sum_{i_1+\dots+i_d=m} \binom{ m }{ i_1 i_2 \dots i_d}^2. (\#eq:p2m) \end{equation}\] This expression looks so complicated that we better start examining is for particular values of \(d\):
For \(d=1\), the expression above simplifies to \(p^{(2m)} = \frac{1}{4^{m}} \binom{2m}{m}\). It is still too complicated sum over all \(m\in{\mathbb{N}}\), but we can simplify it further by using Stirling’s formula \[n! \sim \sqrt{2\pi n} \big(\tfrac{n}{e}\big)^n,\] where \(a_n \sim b_n\) means \(\lim_{n{\rightarrow}\infty} a_n/b_n=1\). Indeed, from there, \[\label{equ:binom} \begin{split} \binom{2m}{m} \sim \frac{4^m}{ \sqrt{\pi m}}, \end{split} \text{ and so } p^{(2m)} \sim \frac{1}{\sqrt{m\pi}}.\] That means that \(p^{(m)}\) behaves li a \(p\)-series with \(p=1/2\) which we know is divergent. Therefore, \[\sum_{m=1}^{\infty} p^{(2m)} = \infty,\] and we recover our previous conclusion that the simple symmetric random walk is, indeed, recurrent.
Moving on to the case \(d= 2\), we notice that the sum of the multinomial coefficients in @ref(eq:p2m) no longer equals \(1\); in fact it is given by10 \[\label{equ:Van} \begin{split} \sum_{i=0}^{m} \binom{m}{i}^2 = \binom{2m}{m}, \end{split}\] and, so, \[p^{(2m)} = \frac{1}{16^m} \Big( \frac{4^m}{\sqrt{\pi m}} \Big)^2 \sim \frac{1}{\pi m} \text{ implying that } \sum_{m=1}^{\infty} p^{(2m)}=\infty,\] which which, in turn, implies that the two-dimensional random walk is also recurrent.
How about \(d\geq 3\)? Things are even more complicated now. The multinomial sum in @ref(eq:p2m) above does not admit a nice closed-form expression as in the case \(d=2\), so we need to do some estimates; these are a bit tedious so we skip them, but report the punchline, which is that \[p^{(2m)} \sim C \Big( \tfrac{3}{m} \Big)^{3/2},\] for some constant \(C\). This is where it gets interesting: this is a \(p\)-series which converges: \[\sum_{m=1}^{\infty} p^{(2m)}<\infty,\] and, so, the random walk is transient for \(d=3\). This is enough to conclude that the random walk is transient for all \(d\geq 3\), too (why?).
To summarize
The simple symmetric random walk is recurrent for \(d=1,2\), but transient for \(d\geq 3\).
In the words of Shizuo Kakutani
A drunk man will find his way home, but a drunk bird may get lost forever.
Certain properties of states are shared between all elements in a class. Knowing which properties have this feature is useful for a simple reason - if you can check them for a single class member, you know automatically that all the other elements of the class share it.
A property is called a class property it holds for all states in its class, whenever it holds for any one particular state in the that class.
Put differently, a property is a class property if and only if either all states in a class have it or none does.
Show that transience and recurrence are class properties.
We use the recurrence criterion proved above.
Suppose that the state \(i\) is recurrent, and that \(j\) is in its class, i.e., that \(i\leftrightarrow j\). Then, there exist natural numbers \(m\) and \(k\) such that \(p^{(m)}_{ij}>0\) and \(p^{(k)}_{ji}>0\). By the Chapman-Kolmogorov relations, for each \(n\in{\mathbb{N}}\), we have \[p^{(n+m+k)}_{jj} =\sum_{l_1\in S} \sum_{l_2\in S} p^{(k)}_{j l_1} p^{(n)}_{l_1 l_2} p^{(m)}_{l_2 m}\geq p^{(k)}_{ji} p^{(n)}_{ii} p^{(m)}_{ij}.\] In other words, there exists a positive constant \(c\) (take \(c=p^{(k)}_{ji}p^{(m)}_{ij}\)), independent of \(n\), such that \[p^{(n+m+k)}_{jj}\geq c p^{(n)}_{ii}.\] The recurrence of \(i\) implies that \(\sum_{n=1}^{\infty}p^{(n)}_{ii}=\infty\), and so \[\sum_{n=1}^{\infty} p^{(n)}_{jj}\geq \sum_{n=m+k+1}^{\infty} p^{(n)}_{jj}= \sum_{n=1}^{\infty} p^{(n+m+k)}_{jj}\geq c \sum_{n=1}^{\infty} p^{(n)}_{ii}=\infty,\] which implies that \(j\) is recurrent. Thus, recurrence is a class property, and since transience is just the opposite of recurrence, it is clear that transience is also a class property, too.
Show that period is a class property, i.e., all elements of a class have the same period.
Let \(d=d(i)\) be the period of the state \(i\), and let \(j\leftrightarrow i\). Then, there exist natural numbers \(m\) and \(k\) such that \(p^{(m)}_{ij}>0\) and \(p^{(k)}_{ji}>0\). By Chapman-Kolmogorov, \[p^{(m+k)}_{ii}\geq p^{(m)}_{ij}p^{(k)}_{ji}>0,\] and so \(m+k\in R(i)\). Similarly, for any \(n\in R(j)\), \[p^{(m+k+n)}_{ii}\geq p^{(m)}_{ij} p^{(n)}_{jj} p^{(k)}_{ji}>0,\] so \(m+k+n\in R(i)\). By the definition of the period, we see now that \(d(i)\) divides both \(m+k\) and \(m+k+n\), and, so, it divides \(n\). This works for each \(n\in R(j)\), so \(d(i)\) is a common divisor of all elements of \(R(j)\); this, in turn, implies that \(d(i)\leq d(j)\). The same argument with roles of \(i\) and \(j\) switched shows that \(d(j)\leq d(i)\). Therefore, \(d(i)=d(j)\).
Now that we know that transience and recurrence are class properties, we can introduce the notion of the of a Markov chain. Let \(S_1,S_2,\dots\) be the collection of all classes; some of them contain recurrent states and some transient ones. We learned in the previous section that if there is one recurrent state in a class, than all states in the class must be recurrent. Thus, it makes sense to call the whole class recurrent. Similarly, the classes which are not recurrent consist entirely of transient states, so we call them transient. There are at most countably many states, so the number of all classes is also at most countable. In particular, there are only countably (or finitely) many recurrent classes, and we usually denote them by \(C_1, C_2, \dots\). Transient classes are denoted by \(T_1,T_2, \dots\). There is no special rule for the choice of indices \(1,2,3,\dots\) for particular classes. The only point is that they can be enumerated because there are at most countably many of them.
The distinction between different transient classes is usually not very important, so we pack all transient states together in a set \(T=T_1\cup T_2\cup \dots\).
Let \(S\) be the state space of a Markov chain \(\{X_n\}_{n\in {\mathbb{N}}_0}\). Let \(C_1,C_2, \dots\) be its recurrent classes, \(T_1,T_2,\dots\) the transient classes, and let \(T=T_1\cup T_2\cup \dots\) be their union. The decomposition \[S= T \cup C_1 \cup C_2 \cup C_3 \cup \dots,\] is called the canonical decomposition of the (state space of the) Markov chain \(\{X_n\}_{n\in {\mathbb{N}}_0}\).
The reason that recurrent classes are important is simple - they can be interpreted as Markov chains themselves. To see why, we start with the following problem:
Show that recurrent classes are necessarily closed.
We argue by contradiction and assume that that \(C\) is a recurrent class which is not closed. Then, there exist states \(i\in C\) and \(j\in C^c\) such that \(i{\rightarrow}j\). On the other hand, since \(j\not\in C\) and \(C\) is a class, we cannot have \(j{\rightarrow}i\). Started at \(i\), the chain will reach \(j\) with positive probability, and, since \(j\not{\rightarrow}i\), never return. That implies that the number of visits to \(i\) will be finite, with positive probability. That is in contradiction with the fact that \(i\) is recurrent and the statement of the Return Theorem above.
The fact we just proved implies the following nice dichotomy, valid for every finite-state-space chain:
A class of a Markov chain on a finite state space is recurrent if and only if it is closed.
We know that recurrent classes are closed. In order to show the converse, we need to prove that transient classes are not closed. Suppose, to the contrary, the there exists a finite state-space Markov chain with a closed transient class \(T\). Since \(T\) is closed, we can see it as a state space of the restricted Markov chain. This, new, Markov chain has a finite number of states so there exists a recurrent state. This is a contradiction with the assumption that \(T\) consists only of transient states.
The condition of finiteness is necessary for the above equivalent to hold. For a random walk on \(\mathbb Z\), all states intercommunicate. In particular, there is only one class - \(\mathbb Z\) itself - and it it trivially closed. If \(p\not=\tfrac{1}{2}\), however, all states are transient, and, so, \(\mathbb Z\) is a closed and transient class.
Together with the canonical decomposition, we introduce the of the transition matrix \(P\). The idea is to order the states in \(S\) with the canonical decomposition in mind. We start from all the states in \(C_1\), followed by all the states in \(C_2\), etc. Finally, we include all the states in \(T\). The resulting matrix looks like this \[P= \begin{bmatrix} P_1 & 0 & 0 & \dots & 0 \\ 0 & P_2 & 0 & \dots & 0 \\ 0 & 0 & P_3 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ Q_1 & Q_2 & Q_3 & \dots & \dots \end{bmatrix},\] where the entries should be interpreted as matrices: \(P_1\) is the transition matrix within the first class, i.e., \(P_1=(p_{ij},i\in C_1, j\in C_1)\), etc. \(Q_k\) contains the transition probabilities from the transient states to the states in the (recurrent) class \(C_k\). We learned, above, that recurrent classes are closed, which implies implies that each \(P_k\) is a stochastic matrix, or, equivalently, that all the entries in the row of \(P_k\) outside of \(P_k\) are zeros.
To help you internalize the notions introduced in this chapter, we classify the states, identify closed sets and discuss periodicity, transience and recurrence in some of the standard examples. In all examples below we assume that \(0 < p < 1\).
Communication and classes. Clearly, it is possible to go from any state \(i\) to either \(i+1\) or \(i-1\) in one step, so \(i{\rightarrow}i+1\) and \(i{\rightarrow}i-1\) for all \(i\in S\). By transitivity of communication, we have \(i{\rightarrow}i+1{\rightarrow}i+2{\rightarrow}\dots{\rightarrow}i+k\). Similarly, \(i{\rightarrow}i-k\) for any \(k\in{\mathbb{N}}\). Therefore, \(i{\rightarrow}j\) for all \(i,j\in S\), and so, \(i\leftrightarrow j\) for all \(i,j\in S\), and the whole \(S\) is one big class.
Closed sets. The only closed set is \(S\) itself.
Transience and recurrence We studied transience and recurrence in the lectures about random walks (we just did not call them that). The situation highly depends on the probability \(p\) of making an up-step. If \(p>\tfrac{1}{2}\), there is a positive probability that the first step will be “up”, so that \(X_1=1\). Then, we know that there is a positive probability that the walk will never hit \(0\) again. Therefore, there is a positive probability of never returning to \(0\), which means that the state \(0\) is transient. A similar argument can be made for any state \(i\) and any probability \(p\not=\tfrac{1}{2}\). What happens when \(p=\tfrac{1}{2}\)? In order to come back to \(0\), the walk needs to return there from its position at time \(n=1\). If it went up, the we have to wait for the walk to hit \(0\) starting from \(1\). We have shown that this will happen sooner or later, but that the expected time it takes is infinite. The same argument works if \(X_1=-1\). All in all, \(0\) (and all other states) are null-recurrent (recurrent, but not positive recurrent).
Periodicity. Starting from any state \(i\in S\), we can return to it after \(2,4,6,\dots\) steps. Therefore, the return set \(R(i)\) is always given by \(R(i)=\{2,4,6,\dots\}\) and so \(d(i)=2\) for all \(i\in S\).
Communication and classes. The winning state \(a\) and the losing state \(0\) are clearly absorbing, and form one-element classes. The other \(a-1\) states intercommunicate among each other, so they form a class of their own. This class is not closed (you can - and will - exit it and get absorbed sooner or later).
Transience and recurrence. The absorbing states \(0\) and \(a\) are (trivially) positive recurrent. All the other states are transient: starting from any state \(i\in\{1,2,\dots, a-1\}\), there is a positive probability (equal to \(p^{a-i}\)) of winning every one of the next \(a-i\) games and, thus, getting absorbed in \(a\) before returning to \(i\).
Periodicity. The absorbing states have period \(1\) since \(R(0)=R(a)={\mathbb{N}}\). The other states have period \(2\) (just like in the case of a random walk).
Communication and classes. A state \(i\) communicates with the state \(j\) if and only if \(j\geq i\). Therefore \(i\leftrightarrow j\) if and only if \(i=j\), and so, each \(i\in S\) is in a class by itself.
Closed sets. The closed sets are precisely the sets of the form \(B={i,i+1,i+2,\dots}\), for \(i\in{\mathbb{N}}\).
Transience and recurrence All states are transient.
Periodicity. The return set \(R(i)\) is empty for each \(i\in S\), so \(d(i)=1\), for all \(i\in S\).
Communication and classes. All the states except for those in \(E=\{ (40,Adv), (40,40), (Adv,40),\) \(\text{P1 wins}, \,\text{P2 wins}\}\) intercommunicate only with themselves, so each \(i\in S\setminus E\) is in a class by itself. The winning states P1 wins and P2 wins are absorbing, and, so, also form classes with one element. Finally, the three states in \(\{(40,Adv),(40,40),(Adv,40)\}\) intercommunicate with each other, so they form the last class.
Periodicity. The states \(i\) in \(S\setminus E\) have the property that \(p^{(n)}_{ii}=0\) for all \(n\in{\mathbb{N}}\), so \(d(i)=1\). The winning states are absorbing so \(d(i)=1\) for \(i\in \{\text{P1 wins, P2 wins}\}\). Finally, the return set for the remaining three states is \(\{2,4,6,\dots\}\) so their period is \(2\).
Let \(C_1\) and \(C_2\) be two (different) classes. For each of the following statements either explain why it is true, or give an example showing that it is false.
\(i{\rightarrow}j\) or \(j{\rightarrow}i\), for all \(i\in C_1\), and \(j\in C_2\),
\(C_1\cup C_2\) is not a class,
If \(i{\rightarrow}j\) for some \(i\in C_1\) and \(j\in C_2\), then \(k\not{\rightarrow}l\) for all \(k\in C_2\) and \(l\in C_1\),
If \(i{\rightarrow}j\) for some \(i\in C_1\) and \(j\in C_2\), then \(k{\rightarrow}l\) for some \(k\in C_2\) and \(l\in C_1\),
Consider a Markov Chain whose transition graph is given below (with orange edges having probability \(1/2\), black \(1\), blue \(3/4\) and green \(1/4\))
Identify the classes.
Find transient and recurrent states.
Find periods of all states.
Compute \(f^{(n)}_{13}\), for all \(n\in{\mathbb{N}}\), where \(f^{(n)}_{ij} ={\mathbb{P}}_i[T_j(1) = n]\).
Using software, we can get that, approximately, \[ P^{20}= \begin{pmatrix} 0 & 0 & 0.15 & 0.14 & 0.07 & 0.14 & 0.21 & 0.29 \\ 0 & 0 & 0.13 & 0.15 & 0.07 & 0.15 & 0.21 & 0.29 \\ 0 & 0 & 0.3 & 0.27 & 0.15 & 0.28 & 0 & 0 \\ 0 & 0 & 0.27 & 0.3 & 0.13 & 0.29 & 0 & 0 \\ 0 & 0 & 0.29 & 0.28 & 0.15 & 0.28 & 0 & 0 \\ 0 & 0 & 0.28 & 0.29 & 0.14 & 0.29 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0.43 & 0.57 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0.43 & 0.57 \end{pmatrix},\] where \(P\) is the transition matrix of the chain. Compute the probability \({\mathbb{P}}[X_{20}=3]\), if the initial distribution (the distribution of \(X_0\)) is given by \({\mathbb{P}}[X_0=1]=1/2\) and \({\mathbb{P}}[X_0=3]=1/2\).
A fair 6-sided die is rolled repeatedly, and for \(n\in{\mathbb{N}}\), the outcome of the \(n\)-th roll is denoted by \(Y_n\) (it is assumed that \(\{Y_n\}_{n\in{\mathbb{N}}}\) are independent of each other). For \(n\in{\mathbb{N}}_0\), let \(X_n\) be the remainder (taken in the set \(\{0,1,2,3,4\}\)) left after the sum \(\sum_{k=1}^n Y_k\) is divided by \(5\), i.e. \(X_0=0\), and \[%\label{} \nonumber \begin{split} X_n= \sum_{k=1}^n Y_k \ (\,\mathrm{mod}\, 5\,),\text{ for } n\in{\mathbb{N}}, \end{split}\] making \(\{X_n\}_{n\in {\mathbb{N}}_0}\) a Markov chain on the state space \(\{0,1,2,3,4\}\) (no need to prove this fact).
Write down the transition matrix of the chain, classify the states, separate recurrent from transient ones, and compute the period of each state.
Which of the following statements is true? Give a short explanation (or a counterexample where appropriate) for your choice. \(\{X_n\}_{n\in {\mathbb{N}}_0}\) is a Markov chain with state space \(S\).
If states \(i\) and \(j\) intercommunicate, then there exists \(n\in{\mathbb{N}}\) such that \(p^{(n)}_{ij}>0\) and \(p^{(n)}_{ji}>0\).
If all rows of the transition matrix are equal, then all states belong to the same class.
If \(P^n{\rightarrow}I\), then all states are recurrent.
(Note: We say that a sequence \(\{A_n\}_{n\in{\mathbb{N}}}\) of matrices converges to the matrix \(A\), and we denote it by \(A_n{\rightarrow}A\), if \((A_n)_{ij}{\rightarrow}A_{ij}\), as \(n{\rightarrow}\infty\), for all \(i,j\).)
Let \(C\) be a class in a Markov chain. For each of the following statements either explain why it is true, or give an example showing that it is false.
\(C\) is closed,
\(C^c\) is closed,
At least one state in \(C\) is recurrent,
For all states \(i,j\in C\), \(p_{ij}>0\),
Consider a Markov chain whose state space has \(n\) elements (\(n\in{\mathbb{N}}\)). For each of the following statements either explain why it is true, or give an example showing that it is false.
all classes are closed
at least one state is transient,
not more than half of all states are transient,
there are at most \(n\) classes,
Let \(i\) be a recurrent state with period 5, and let \(j\) be another state. For each of the following statements either explain why it is true, or give an example showing that it is false.
if \(j{\rightarrow}i\), then \(j\) is recurrent,
if \(j{\rightarrow}i\), then \(j\) has period \(5\),
if \(i{\rightarrow}j\), then \(j\) has period \(5\),
if \(j\not{\rightarrow}i\) then \(j\) is transient,
Let \(i\) and \(j\) be two states such that \(i\) is transient and \(i\leftrightarrow j\). For each of the following statements either explain why it is true, or give an example showing that it is false.
if \(i{\rightarrow}k\), then \(k\) is transient,
if \(k{\rightarrow}i\), then \(k\) is transient,
period of \(i\) must be \(1\),
(extra credit) \(\sum_{n=1}^{\infty} p^{(n)}_{jj} = \sum_{n=1}^{\infty} p^{(n)}_{ii}\),
Suppose there exists \(n\in{\mathbb{N}}\) such that \(P^n=I\), where \(I\) is the identity matrix and \(P\) is the transition matrix of a finite-state-space Markov chain. For each of the following statements either explain why it is true, or give an example showing that it is false.
\(P=I\).
All states belong to the same class.
All states are recurrent.
The period of each state is \(n\).
Suppose that all classes of a Markov chain are recurrent, and let \(i,j\) be two states such that \(i{\rightarrow}j\). For each of the 4 statements before, either explain why it is true, or give an example of a Markov chain in which it fails.
for each state \(k\), either \(i{\rightarrow}k\) or \(j{\rightarrow}k\)
\(j{\rightarrow}i\)
\(p_{ji}>0\) or \(p_{ij}>0\)
\(\sum_{n=1}^{\infty} p^{(n)}_{jj}<\infty\)
Caveat: From now on, all Markov chains will have finite state spaces.
Remember the “Tennis” example from a few lectures ago and the question we asked there, namely, how does the probability of winning a single point affect the probability of winning the overall game? An algorithm that will help you answer that question will be described in this lecture.
The first step is to understand the structure of the question asked in the light of the canonical decomposition of the previous lecture. In the “Tennis” example, all the states except for the winning ones are transient, and there are two one-element recurrent classes {“Player 1 wins”} and {“Player 2 wins”} The chain starts from a transient state \((0,0)\), moves around a bit, and, eventually, gets absorbed in one of the two. The probability we are interested in is not the probability that the chain will eventually get absorbed. That probability is always \(1\). We are, instead, interested in the probability that the absorption will occur in a particular state - the state “Player 1 wins” (as opposed to “Player 2 wins”) in the “Tennis” example.
A more general version of the problem above is the following: let \(i\in S\) be any state, and let \(j\) be a recurrent state. If the set of all recurrent states is denoted by \(C\), and if \(\tau_{C}\) is the first hitting time of the set \(C\), then \(X_{\tau_{C}}\) denotes the first recurrent state visited by the chain. Equivalently, \(X_{\tau_{C}}\) is the value of \(X\) at (random) time \(\tau_{C}\); its value is the name of the state in which it happens to find itself the first time it hits the set of all recurrent states. For any two states \(i,j\in S\), the \(u_{ij}\) is defined as \[u_{ij}={\mathbb{P}}_i[ X_{\tau_C}=j]={\mathbb{P}}_i[\text{ the first recurrent state visited by $X$ is $j$ }].\] There are several boring situations to discard first:
\(j\) is transient: in this case \(u_{ij}=0\) for any \(i\) because \(j\) cannot possibly be the first recurrent state we hit - it is not even recurrent.
\(j\) is recurrent, and so is \(i\). Since \(i\) is recurrent, i.e., \(i\in C\), we clearly have \(\tau_C=0\). Therefore \(u_{ij} = {\mathbb{P}}_i[ X_0= j]\), and this equals to either \(1\) or \(0\), depending on whether \(i=j\) or \(i\ne j\).
That leaves us with the situation where \(i \in T\) and \(j\in C\) as the interesting one. In many calculations related to Markov chains, the method of first-step decomposition works miracles. Simply, we cut the probability space according to what happened in the first step and use the law of total probability (assuming \(i\in T\), \(j\in C\)) \[\label{equ:system-for-u} \nonumber \begin{split} u_{ij} & ={\mathbb{P}}_i[ X_{\tau_C}=j]=\sum_{k\in S} {\mathbb{P}}[X_{\tau_C}=j|X_0=i, X_1=k] {\mathbb{P}}[ X_1=k|X_0=i]\\ &= \sum_{k\in S} {\mathbb{P}}[X_{\tau_C}=j|X_1=k]p_{ik} \end{split}\] The conditional probability \({\mathbb{P}}[X_{\tau_C}=j|X_1=k]\) is an absorption probability, too. If \(k=j\), then \({\mathbb{P}}[X_{\tau_C}=j|X_1=k]=1\). If \(k\in C\setminus\{j\}\), then we are already in C, but in a state different from \(j\), so \({\mathbb{P}}[ X_{\tau_C}=j|X_1=k]=0\). Therefore, the sum above can be written as \[\label{equ:syst} \begin{split} u_{ij}= \sum_{k\in T} p_{ik} u_{kj} + p_{ij}, \end{split}\] which is a system of linear equations for the family \(( u_{ij}, i\in T, j\in C)\). Linear systems are typically better understood when represented in the matrix form. Let \(U\) be a \(T\times C\)-matrix \(U=(u_{ij}, i\in T, j\in C)\), and let \(Q\) be the portion of the transition matrix \(P\) corresponding to the transitions from \(T\) to \(T\), i.e. \(Q=(p_{ij},i\in T, j\in T)\), and let \(R\) contain all transitions from \(T\) to \(C\), i.e., \(R=(p_{ij})_{i\in T, j\in C}\). If \(P_C\) denotes the matrix of all transitions from \(C\) to \(C\), i.e., \(P_C=(p_{ij}, i\in C, j\in C)\), then the canonical form of \(P\) looks like this: \[P= \begin{bmatrix} P_C & 0 \\ R & Q \end{bmatrix}.\] The system now becomes: \[U= QU+R,\text{ i.e., } (I-Q) U = R.\] If the matrix \(I-Q\) happens to be invertible, we are in business, because we then have an explicit expression for \(U\): \[U= (I-Q)^{-1} R.\] So, is \(I-Q\) invertible? It is when the state space \(S\) is finite; here is the argument, in case you are interested:
Theorem. When the state space \(S\) is finite, the matrix \(I-Q\) is invertible and \[ \begin{split} (I-Q)^{-1} = \sum_{n=0}^{\infty} Q^n. \end{split}\] Moreover, the entry at the position \(i,j\) in \((I-Q)^{-1}\) is the expected total number of visits to the state \(j\), for a chain started at \(i\).
Proof. For \(k\in{\mathbb{N}}\), the matrix \(Q^k\) is the same as the submatrix corresponding to the transient states of the full \(k\)-step transition matrix \(P^k\). Indeed, going from a transient state to another transient state in \(k\) steps can only happen via other transient states (once we hit a recurrent class, we are stuck there forever).
Using the same idea as in the proof of our recurrence criterion in the previous chapter we can conclude that for any two transient states \(i\) and \(j\), we have (remember \({\mathbb{E}}_i[ \mathbf{1}_{\{X_n = j\}}] = {\mathbb{P}}_i[X_n = j] = p_{ij}^{(n)}\)) \[{\mathbb{E}}_i[ \sum_{n=0}^{\infty} \mathbf{1}_{\{X_n = j\}}] = \sum_{n\in{\mathbb{N}_0}} p^{(n)}_{ij} = \sum_{n\in{\mathbb{N}_0}} q^{(n)}_{ij} = (\sum_{n\in{\mathbb{N}}_0} Q^n)_{ij}.\] On the other hand, the left hand side above is simply the expected number of visits to the state \(j\), if we start from \(i\). Since both \(i\) and \(j\) are transient, this number will either be \(0\) (if the chain never even reaches \(j\) from \(i\)), or a geometric random variable (if it does). In either case, the expected value of this quantity is finite, and, so \[\sum_{n\in{\mathbb{N}}_0} q^{(n)}_{ij}<\infty.\] Therefore, the matrix sum \(F = \sum_{n\in{\mathbb{N}}_0} Q^n\) is well defined, and it remains to make sure that \(F = (I-Q)^{-1}\), which follows from the following simple computation: \[QF = Q \sum_{n\in{\mathbb{N}}_0} Q^n = \sum_{n\in{\mathbb{N}}_0} Q^{n+1} = \sum_{n\in{\mathbb{N}}} Q^n = \sum_{n\in{\mathbb{N}}_0} Q^n - I = F - I. \text{ Q.E.D.}\]
When the inverse \((I-Q)^{-1}\) exists (like in the finite case), it is called the fundamental matrix of the Markov chain.
Before we turn to the “Tennis” example, let us analyze a simpler case of Gambler’s ruin with \(a=3\).
What is the probability that a gambler coming in at \(x=\$1\) in a Gambler’s ruin problem with \(a=3\) succeeds in “getting rich”? We do not assume that \(p=\tfrac{1}{2}\).
The states \(0\) and \(3\) are absorbing, and all the others are transient. Therefore \(C_1=\{0\}\), \(C_2=\{3\}\) and \(T=T_1=\{1,2\}\). The transition matrix \(P\) in the canonical form (the rows and columns represent the states in the order \(0,3,1,2\)) \[P= \begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 1-p & 0 & 0 & p\\ 0 & p & 1-p & 0 \end{bmatrix}\] Therefore, \[R= \begin{bmatrix} 1-p & 0 \\ 0 & p \end{bmatrix} \text{ and } Q= \begin{bmatrix} 0 & p \\ 1-p & 0 \end{bmatrix}.\] The matrix \(I-Q\) is a \(2\times 2\) matrix so it is easy to invert: \[(I-Q)^{-1}= \frac{1}{1-p+p^2}\begin{bmatrix} 1 & p \\ 1-p & 1 \end{bmatrix}.\] So \[U= \frac{1}{1-p+p^2}\begin{bmatrix} 1 & p \\ 1-p & 1 \end{bmatrix} \begin{bmatrix} 1-p & 0 \\ 0 & p \end{bmatrix} = \begin{bmatrix} \frac{1-p}{1-p+p^2} & \frac{p^2}{1-p+p^2} \\ \frac{(1-p)^2}{1-p+p^2} & \frac{p}{1-p+p^2} \\ \end{bmatrix}.\] Therefore, for the initial “wealth” is 1, the probability of getting rich before bankruptcy is \(u_{13}=p^2/(1-p+p^2)\) (the entry in the first row (\(x=1\)) and the second column (\(a=3\)) of \(U\).)
Find the probability of winning a whole game of Tennis, for a player whose probability of winning a single rally is \(p=0.45\).
In the “Tennis” example, the transition matrix is \(20\times 20\), with only 2 recurrent states (each in its own class). In order to find the matrix \(U\), we (essentially) need to invert an \(18\times 18\) matrix and that is a job for a computer. We start with an R function which produces the transition matrix \(P\) as a function of the single-rally probability \(p\). Even though we only care about \(p=0.45\) here, the extra flexibility will come in handy soon:
S= c("0-0", "0-15", "15-0", "0-30", "15-15", "30-0", "0-40", "15-30",
"30-15", "40-0", "15-40", "30-30", "40-15", "40-30", "30-40",
"40-40", "40-A", "A-40", "P1", "P2")
tennis_P = function(p) {
matrix(c(
0,1-p,p,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,1-p,p,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,1-p,p,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1-p,p,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1-p,p,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,1-p,p,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,p,0,0,0,0,0,0,0,0,1-p,
0,0,0,0,0,0,0,0,0,0,1-p,p,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,1-p,p,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,1-p,0,0,0,0,0,p,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,p,0,0,0,0,1-p,
0,0,0,0,0,0,0,0,0,0,0,0,0,p,1-p,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,1-p,0,0,0,0,p,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1-p,0,0,p,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,p,0,0,0,1-p,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1-p,p,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,p,0,0,0,1-p,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1-p,0,0,p,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1),
byrow=T, ncol = 20 )
}
The positions of the initial state “0-0” in the state-space vector
S is \(1\), and the
positions of the two absorbing states “P1” and “P2” are \(19\) and \(20\). Therefore the matrices \(Q\) and \(R\) are obtained by vector indexing as
follows:
P = tennis_P(0.45)
Q = P[1:18, 1:18]
R = P[1:18, 19:20]
Linear systems are solved by using the command solve in
R:
I = diag(18) # the identity matrix the same size as Q
U = solve(I - Q, R)
U[1, ]
## [1] 0.38 0.62
Therefore, the probability that Player 1 wins the entire rally is
about \(0.377\). Note that this number
is smaller than \(0.45\), so it appears
that the game is designed to make it easier for the better player to
win. For more evidence, let’s draw the graph of this probability for
several values of \(p\)
(sapply is the version of apply for
vectors):
prob_win = function(p) {
if (p %in% c(0, 1))
return(p)
P = tennis_P(p)
Q = P[1:18, 1:18]
R = P[1:18, 19:20]
U = solve(diag(18) - Q, R)
U[1, 1]
}
ps = seq(0, 1, by = 0.01)
prob_game = sapply(ps, prob_win)
A graph of p vs. prob_game, where the
dashed line is the line \(y=x\) looks
like this:
Using a symbolic software package (like Mathematica) we can even get an explicit expression for the win probability in this case: \[\begin{align} u_{(0,0)\ \ "P1\ wins"} = p^4 + 4 p^4 q + 10 p^4 q^2 + \frac{20 p^5 q^3}{1-2pq}. \end{align}\] Actually, you don’t really need computers to derive the expression above. Can you do it by finding all the ways in which the game can be won in \(n=4,5,6,8, 10, 12, \dots\) rallies, computing their probabilities, and then adding them all up?
Suppose that each time you visit a transient state \(j\in T\) you receive a reward \(g(j)\in{\mathbb{R}}\). The name “reward” is a bit misleading since the negative \(g(j)\) corresponds more to a fine than to a reward; it is just a name, anyway. Can we compute the expected total reward before absorption \[v_i={\mathbb{E}}_i[ \sum_{n=0}^{\tau_{C}-1} g(X_n)] ?\] And if we can, what is it good for? Many things, actually, as the following two special cases show:
If \(g(j)=1\) for all \(j\in T\), then \(v_i\) is the expected time until absorption. We will calculate \(v_{(0,0)}\) for the “Tennis” example to compute the expected duration of a tennis game.
If \(g(k)=1\) and \(g(j)=0\) for \(j\not =k\), then \(v_i\) is the expected number of visits to the state \(k\) before absorption. In the “Tennis” example, if \(k=(40,40)\), the value of \(v_{(0,0)}\) is the expected number of times the score \((40,40)\) is seen in a tennis game.
We compute \(v_i\) using the first-step decomposition: \[\label{equ:} % \nonumber \begin{split} v_i &={\mathbb{E}}_i[ \sum_{n=0}^{\tau_C - 1} g(X_n)] = g(i)+ {\mathbb{E}}_i[ \sum_{n=1}^{\tau_C - 1} g(X_n)]\\ &= g(i)+ \sum_{k\in S} {\mathbb{E}}_i[ \sum_{n=1}^{\tau_C - 1} g(X_n)|X_1=k] {\mathbb{P}}_i[X_1=k]\\ & = g(i)+ \sum_{k\in S} p_{ik}{\mathbb{E}}_i[ \sum_{n=1}^{\tau_C - 1} g(X_n)|X_1=k] \end{split}\] If \(k\in T\), then the Markov property implies that \[{\mathbb{E}}_i[ \sum_{n=1}^{\tau_C - 1} g(X_n)|X_1=k]={\mathbb{E}}_k[ \sum_{n=0}^{\tau_C - 1} g(X_n)]=v_k.\] When \(k\not\in T\), then \[{\mathbb{E}}_i[ \sum_{n=1}^{\tau_C - 1} g(X_n)|X_1=k]=0,\] because we have “arrived” and no more rewards are going to be collected. Therefore, for \(i\in T\) we have \[v_i=g(i)+\sum_{k\in T} p_{ik} v_k.\] If we organize all \(v_i\) and all \(g(i)\) into column vectors \(v=(v_i, i\in T)\), \(g=(g(i), i\in T)\), we get \[v=Qv+g, \text{ i.e., } v=(I-Q)^{-1} g = Fg.\]
Having derived the general formula for various rewards, we can provide another angle to the interpretation of the fundamental matrix \(F\). Let us pick a transient state \(j\) and use the reward function \(g\) given by \[g(k)=\mathbf{1}_{\{k=j\}}= \begin{cases} 1, & k=j \\ 0,& k\not= j. \end{cases}\] By the discussion above, the \(i^{th}\) entry in \(v=(I-Q)^{-1} g\) is the expected reward when we start from the state \(i\). Given the form of the reward function, \(v_i\) is the expected number of visits to the state \(j\) when we start from \(i\). On the other hand, as the product of the matrix \(F=(I-Q)^{-1}\) and the vector \(g=(0,0,\dots, 1, \dots, 0)\), \(v_i\) is nothing but the \((i,j)\)-entry in \(F=(I-Q)^{-1}\).
Let’s illustrate these ideas on some of our example chains:
What is the expected duration of a game of tennis? Compute it for several values of the parameter \(p\).
The main idea is to perform a reward computation with \(g(i)=1\) for all transient states \(i\). The R code is very similar to the one in the absorption example:
expected_duration = function(p) {
if (p %in% c(0, 1))
return(4)
P = tennis_P(p)
Q = P[1:18, 1:18]
g = matrix(1, nrow = 18, ncol = 1)
v = solve(diag(18) - Q, g)
v[1, ]
}
ps = seq(0, 1, by = 0.01)
duration_game = sapply(ps, expected_duration)
As above, here is the graph of p
vs. duration_game:
The maximum of the curve about equals to \(6.75\), and is achieved when the players are evenly matched (\(p=0.5\)). Therefore, a game between fairly equally matched opponents lasts \(6.75\). The game cannot be shorter than \(4\) rallies and that is exactly the expected duration when one player wins with certainty in each rally.
What is the expected number of “deuces”, i.e., scores \((40,40)\)? Compute it for several values of the parameter \(p\).
This can be computed exactly as above, except that now the reward
function is given by \[\begin{align}
g(i) = \begin{cases}
1, & \text{ if } i = (40,40),\\
0, & \text{ otherwise.}
\end{cases}
\end{align}\] Since the code is almost identical to the code from
the last example, we skip it here and only draw the graph:
As expected, there are no dueces when \(p=0\) or \(p=1\), and the maximal expected number of
dueces - \(0.625\) - occurs when \(p=1/2\).
These numbers are a bit misleading, though, and, when asked, people would usually give a higher estimate for this expectation. The reason is that the expectation is a poor summary of for the full distribution of the number of deuces. The best way yo to get a feeling for the entire distribution is to run some simulations. Here is the histogram of \(10000\) simulations of a game of tennis for the most interesting case \(p=0.5\):
We see that most of the games have no deuces. However, in the cases where a deuce does happen, it is quite possible it will be repeated. A sizable number of draws yielded \(4\) of more deuces.
We end with another example from a different area:
Alice plays the following game. She picks a pattern consisting of three letters from the set \(\{H,T\}\), and then tosses a fair coin until her pattern appears for the first time. If she has to pay \(\$1\) for each coin toss, what is the expected cost she is going to incur? What pattern should she choose to minimize that cost?
We start by choosing a pattern, say \(HTH\), and computing the number of coin tosses Alice expects to make before it appears. This is just the kind of computation that can be done using our absorption-and-reward techniques, if we can find a suitable Markov chain. It turns out that the following will do (green arrows stand for probability \(1/2\)):
As Alice tosses the coin, she keeps track of the largest initial portion of her pattern that appears at last several places of the sequence of past tosses. The state \(0\) represents no such portion (as well as the intial state), while \(HT\) means that the last two coin tosses were \(H\) and \(T\) (in that order) so that it is possible to end the game by tossing a \(H\) next. On the other hand, if the last toss was a \(T\), there is no need to keep track of that - it is as good as \(0\).
Once we have this chain, all we have to do is perform the absorption and reward computation with the reward function \(g\equiv 1\). The \(Q\)-matrix of this chain (with the transient states ordered as \(0, H, HT\)) is \[Q = \begin{bmatrix} 1/2 & 1/2 & 0 \\ 0 & 1/2 & 1/2 \\ 1/2 & 0 & 0\\ \end{bmatrix}\] and the fundamental matrix \(F\) turns out to be \[F = \begin{bmatrix} 4 & 4 & 2 \\ 2 & 4 & 2 \\ 2 & 2 & 2 \\ \end{bmatrix} .\] Therefore, the required expectation is the sum of all the elements on the first row, i.e., \(10\).
Let us repeat the same for the pattern \(HHH\). We build a similar Markov chain:
We see that there is a subtle difference. One transition from the state \(H\), instead of going back to itself, is directed towards \(0\). It is clear from here, that this can only increase Alice’s cost. Indeed, the fundamental matrix is now given by \[F = \begin{bmatrix} 8 & 4 & 2 \\ 6 & 4 & 2 \\ 4 & 2 & 2 \\ \end{bmatrix},\] and the expected number of tosses before the first appearance of \(HHH\) comes out as \(14\).
Can you do this for other patterns? Which one should Alice choose to minimize her cost?
Note: do not use simulations in any of the problems below. Using R (or other software) to manipulate matrices or perform other numerical computation is fine.
The fundamental matrix associated to a finite Markov chain is \(F = \begin{bmatrix} 3 & 3 \\ 3/2 & 3\end{bmatrix}\), with the first row (and column) corresponding to the state \(A\) and the second to \(B\). Some of the following statements are true and the others are false. Find which ones are true and which are false; give explanations for your choices.
The chain has \(2\) recurrent states.
If the chain starts in \(A\), the expected number of visits to \(B\) before hitting the first recurrent state is \(3\).
If the chain is equally likely to start from \(A\) or \(B\), the expected number of steps it will take before it hits its first recurrent state is \(\frac{21}{4}\).
\({\mathbb{P}}_A[X_1=C] =0\) for any recurrent state \(C\).
In a Markov chain with a finite number of states, the fundamental matrix is given by \[F=\begin{bmatrix} 3 & 4 \\ \tfrac{3}{2} & 4\end{bmatrix}.\] The initial distribution of the chain is uniform on all transient states. Compute the expected value of \[\tau_C=\min \{ n\in{\mathbb{N}}_0\, : \, X_n\in C\},\] where \(C\) denotes the set of all recurrent states.
Consider the “Gambler’s ruin” model with parameter \(p\). Write an R function that computes the (vector of) probabilities that the gambler will go bankrupt before her wealth reaches \(\$1000\) for each initial wealth \(x = 0,1,\dots, 1000\). Plot the graphs for \(p=0.4, p=0.49, p=0.499\) and \(p=0.5\) on top of each other. Did you expect them to look the way they do?
A basketball player is shooting a series of free throws. The probability of hitting any single one is \(1/2\), and the throws are independent of each other. What is the expected number of throws the player will attempt before hitting 3 free throws in a row (including those 3)?
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a Markov chain with the following transition matrix \[P= \begin{bmatrix} 1/2 & 1/2 & 0 \\ 1/3 & 1/3 & 1/3 \\ 0 & 0 & 1\\ \end{bmatrix}\] Suppose that the chain starts from the state \(1\).
What is expected time that will pass before the chain first hits \(3\)?
What is the expected number of visits to state \(2\) before \(3\) is hit?
Would your answers to 1. and 2. change if we replaced values in the third row of \(P\) by any other values (as long as \(P\) remains a stochastic matrix)? Would \(1\) and \(2\) still be transient states?
Use the idea of part 3. to answer the following question. What is the expected number of visits to the state \(2\) before a Markov chain with transition matrix \[P= \begin{bmatrix} 17/20 & 1/20 & 1/10\\ 1/15 & 13/15 & 1/15\\ 2/5 & 4/15 & 1/3\\ \end{bmatrix}\] hits the state \(3\) for the first time (the initial state is still \(1\))? Remember this trick for your next exam.
A fair 6-sided die is rolled repeatedly, and for \(n\in{\mathbb{N}}\), the outcome of the \(n\)-th roll is denoted by \(Y_n\) (it is assumed that \(\{Y_n\}_{n\in{\mathbb{N}}}\) are independent of each other). For \(n\in{\mathbb{N}}_0\), let \(X_n\) be the remainder (taken in the set \(\{0,1,2,3,4\}\)) left after the sum \(\sum_{k=1}^n Y_k\) is divided by \(5\), i.e. \(X_0=0\), and \[%\label{} \nonumber \begin{split} X_n= \sum_{k=1}^n Y_k \ (\,\mathrm{mod}\, 5\,),\text{ for } n\in{\mathbb{N}}, \end{split}\] making \(\{X_n\}_{n\in {\mathbb{N}}_0}\) a Markov chain on the state space \(\{0,1,2,3,4\}\) (no need to prove this fact).
Write down the transition matrix of the chain.
Classify the states, separate recurrent from transient ones, and compute the period of each state.
Compute the expected number of rolls before the first time \(\{X_n\}_{n\in {\mathbb{N}}_0}\) visits the state \(2\), i.e., compute \({\mathbb{E}}[\tau_2]\), where \[\tau_2=\min \{ n\in{\mathbb{N}}_0\, : \, X_n=2\}.\]
Compute \({\mathbb{E}}[\sum_{k=0}^{\tau_2-1} X_k]\).
Let \(\{Y_n\}_{n\in {\mathbb{N}}_0}\) be a sequence of die-rolls, i.e., a sequence of independent random variables with distribution \[Y_n \sim \left( \begin{array}{cccccc} 1 & 2 & 3 & 4 & 5 & 6 \\ 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 \end{array} \right).\] Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a stochastic process defined by \(X_n=\max (Y_0,Y_1, \dots, Y_n)\). In words, \(X_n\) is the maximal value rolled so far.
Explain why \(X\) is a Markov chain, and find its transition matrix and the initial distribution.
Supposing that the first roll of the die was \(3\), i.e., \(X_0=3\), what is the expected time until a \(6\) is reached?
Under the same assumption as above (\(X_0=3\)), what is the probability that a \(5\) will not be rolled before a \(6\) is rolled for the first time?
Starting with the first value \(X_0=3\), each time a die is rolled, the current record (the value of \(X_n\)) is written down. When a \(6\) is rolled for the first time all the numbers are added up and the result is called \(S\) (the final \(6\) is not counted). What is the expected value of \(S\)?
Go back to the problem with Basil the rat in the he first lecture on Markov chains and answer the question 2., but this time using an absorption/reward computation.
Go back to the problem with the professor and his umbrellas in the first lecture on Markov chains and answer the questions in part 2., but this time using an absorption/reward computation.
An airline reservation system has two computers. Any computer in operation may break down on any given day with probability \(p=0.3\), independently of the other computer. There is a single repair facility which takes two days to restore a computer to normal. It can work on only one computer at a time, and if two computers need work at the same time, one of them has to wait and enters the facility as soon as it is free again.
The system starts with one operational computer; the other one broke last night and just entered the repair facility this morning.
Compute the probability that at no time will both computers be down simultaneously between now and the first time both computers are operational.
Assuming that each day with only one working computer costs the company \(\$10,000\) and each day with both computers down \(\$30,000\), what is the total cost the company is expected to incur between now and the first time both computers are operational again.
it may interfere with your existing installation↩︎
be careful, though. The expression x = y is
not the same as x == y. It does not return a logical value
- it assigns the value of y to x↩︎
There are infinitely many ways random variables can be distributed. Indeed, in the discrete \({\mathbb N}\)-valued case only, any sequence of nonnegative numbers \((p_n)_n\) such that \(\sum_n p_n=1\) defines a probability distribution. It turns out, however, that a small-ish number of distributions appear in nature much more often then the rest. These distributions, like the normal, uniform, exponential, binomial, etc. turn out to be so important that they each get a name (hence named distributions). ↩︎
Some books will define the geometric random variables as the number of tosses (and not Ts) before the first H is obtained. In that case, the final H is included into the count. Clearly, this definition and the one we have given differ by \(1\), and this is really not a big deal, but you have to be careful about what is meant when a geometric random variable is mentioned.↩︎
The function sum adds up all the components
of the vector. You would not want such a function to be vectorized. If
it were, it would return exactly the same vector it got as input.↩︎
It is somewhat unfortunate that the standard notation
for the time horizon, namely \(T\),
coincides with a shortcut T for TRUE in R. Our
example still works fine because this shortcut is used only if there is
no variable named T.↩︎
The function apply is often used as a
substitute for a for loop because it has several advantages
over it. First, the code is much easier to read and understand. Second,
apply can easily be parallelized. Third, while this is not
such a big issue anymore, for loops used to be orders of
magnitude slower than the corresponding apply in the past.
R’s for loops got much better recently, but they still lag
behind apply in some cases. To be fair, apply
is known to use more memory than for in certain
cases.↩︎
For \(d=2\) we could have used the values “up”, “down”, “left” and “’right”, for \(1,-1,2\) or \(-2\), respectively. In dimension \(3\), we could have added “forward” and “backward”, but we run out of words for directions for larger \(d\).↩︎
\(\binom{m}{i_1 \dots i_d}\) is called the multinomial coefficient. It counts the number of ways we can color \(m\) objects into one of \(d\) colors such that there are \(i_1\) objects of color \(1\), \(i_2\) of color \(2\), etc. It is a generalization of the binomial coefficient and its value is given by \[\binom{ m }{ i_1 i_2 \dots i_d} = \frac{m!}{i_1! i_2!\dots i_d!}.\]↩︎
Why is this identity true? Can you give a counting argument?↩︎